Databricks Unity Catalog 完整 Hands-On 教程

一、Unity Catalog 到底是什么?

Unity Catalog 是 Databricks 在 2021 年 Data+AI Summit 上发布的统一数据治理层,解决的核心问题是:在多 Workspace、多云、多团队环境下,如何用一个地方管好所有数据和 AI 资产的元数据、权限、审计和血缘 。[1][2]

在 Unity Catalog 之前,Databricks 用的是 Hive Metastore,每个 Workspace 有自己独立的 metastore,权限管不了跨 Workspace,治理全靠第三方工具。Unity Catalog 把这些问题一次性解决了 。[^1]

核心能力一句话概括

  • Define once, secure everywhere — 权限定义一次,所有 Workspace 生效[^2]
  • ANSI SQL 权限模型 — 用标准 SQL 的 GRANT/REVOKE 管权限[^2]
  • 内建审计和血缘 — 自动记录谁访问了什么数据,数据从哪来到哪去[3][4]
  • 数据发现 — 可以打 Tag、搜索、浏览所有数据资产[^2]

二、架构与对象模型(Object Model)

2.1 三级命名空间

Unity Catalog 的核心设计是一个三级命名空间(Three-Level Namespace):[5][2]

Metastore
 └── Catalog          (第1级:逻辑隔离单元,类似"数据库集群")
      └── Schema       (第2级:也叫 Database,逻辑分组)
           ├── Table   (第3级:结构化数据)
           ├── View    (第3级:保存的查询)
           ├── Volume  (第3级:非表格化数据,如文件)
           ├── Function(第3级:UDF)
           └── Model   (第3级:MLflow 注册的 ML 模型)

引用一个表的方式

catalog_name.schema_name.table_name

比如 sales_prod.orders.orders_delta 就是在 sales_prod 目录下的 orders Schema 里的 orders_delta 表 。[6][1]

2.2 Metastore:顶层容器

Metastore 是 Unity Catalog 的最顶层容器,负责存储所有元数据并管理权限。每个 Region 应该有一个 Metastore,多个 Workspace 可以共享同一个 Metastore 。[7][2]

关键区别:Hive Metastore 是每个 Workspace 一个,Unity Catalog 的 Metastore 是跨 Workspace 共享的,这是治理能力的根本来源。

2.3 完整对象层级图

┌────────────────────────────────────┐
│            Metastore               │  ← 区域级,一个 region 一个
│  ┌──────────────────────────────┐  │
│  │  Storage Credentials         │  │  ← 云存储凭证(IAM Role/SPN)
│  │  External Locations          │  │  ← 云存储路径 + 凭证绑定
│  │  Connections (Federation)    │  │  ← 外部数据库连接
│  ├──────────────────────────────┤  │
│  │  Catalog: sales_prod         │  │  ← 第1级
│  │   ├── Schema: orders         │  │  ← 第2级
│  │   │    ├── Table (managed)   │  │  ← 第3级
│  │   │    ├── Table (external)  │  │
│  │   │    ├── View              │  │
│  │   │    ├── Volume            │  │
│  │   │    ├── Function (UDF)    │  │
│  │   │    └── Model (MLflow)    │  │
│  │   └── Schema: analytics      │  │
│  ├──────────────────────────────┤  │
│  │  Catalog: dev                │  │
│  │   └── ...                    │  │
│  └──────────────────────────────┘  │
└────────────────────────────────────┘

三、Managed Table vs External Table

这是 Unity Catalog 中最核心的概念区分之一 :[8][2]

特性 Managed Table External Table
数据存储 Unity Catalog 管理的存储位置 你指定的外部云存储路径
文件格式 仅 Delta Lake Delta, CSV, JSON, Parquet, Avro, ORC, Text
生命周期 DROP TABLE = 删元数据 + 删数据文件 DROP TABLE = 只删元数据,数据文件保留
典型场景 新建的表,全部在 Databricks 内管理 已有数据在 S3/ADLS,注册进来治理
推荐程度 ✅ 推荐大多数场景 需要 Databricks 外部也能访问数据时使用

四、Hands-On 实战代码

以下代码可以直接在 Databricks Notebook 中运行。

4.1 探索当前环境

-- 查看当前 Metastore
SELECT CURRENT_METASTORE();

-- 查看当前 Catalog
SELECT CURRENT_CATALOG();

-- 查看当前 Schema
SELECT CURRENT_SCHEMA();

-- 列出所有 Catalog
SHOW CATALOGS;

-- 列出当前 Catalog 下的所有 Schema
SHOW SCHEMAS;

4.2 创建 Catalog 和 Schema

-- 创建 Catalog(需要 metastore admin 或 CREATE CATALOG 权限)
CREATE CATALOG IF NOT EXISTS logistics_prod
COMMENT '物流生产环境数据目录';

-- 切换到新建的 Catalog
USE CATALOG logistics_prod;

-- 创建 Schema
CREATE SCHEMA IF NOT EXISTS shipments
COMMENT '运输数据模块';

-- 创建带自定义存储位置的 Schema(数据物理隔离)
CREATE SCHEMA IF NOT EXISTS sensitive_data
MANAGED LOCATION 's3://my-secure-bucket/sensitive/'
COMMENT '敏感数据,物理隔离存储';

-- 查看 Schema 详情
DESCRIBE SCHEMA EXTENDED shipments;

如果你需要用 带自定义存储位置的 Catalog

CREATE CATALOG IF NOT EXISTS hr_prod
MANAGED LOCATION 's3://mycompany-hr-prod/unity-catalog/'
COMMENT 'HR 生产数据,合规要求独立存储';

4.3 创建 Managed Table

USE CATALOG logistics_prod;
USE SCHEMA shipments;

-- 创建 Managed Delta 表
CREATE TABLE IF NOT EXISTS flight_delays (
    flight_id       STRING,
    airline         STRING,
    origin          STRING,
    destination     STRING,
    delay_minutes   INT,
    flight_date     DATE
)
COMMENT '航班延误数据'
PARTITIONED BY (flight_date);

-- 插入数据
INSERT INTO flight_delays VALUES
    ('LH1234', 'Lufthansa', 'FRA', 'JFK', 15, '2026-01-10'),
    ('UA5678', 'United', 'EWR', 'FRA', -5, '2026-01-10'),
    ('LH9012', 'Lufthansa', 'FRA', 'NRT', 45, '2026-01-11'),
    ('BA3456', 'British Airways', 'LHR', 'FRA', 0, '2026-01-11');

-- 查看表数据
SELECT * FROM flight_delays;

-- 查看表的详细信息(包括存储位置、格式等)
DESCRIBE TABLE EXTENDED flight_delays;

-- 查看表的物理存储详情
DESCRIBE DETAIL flight_delays;

-- 查看表的版本历史(Delta Lake 特性)
DESCRIBE HISTORY flight_delays;

4.4 创建 External Table

External Table 需要先设置 Storage CredentialExternal Location 。[9][2]

-- Step 1: 查看已有的 Storage Credentials
SHOW STORAGE CREDENTIALS;

-- Step 2: 创建 External Location(需要已有 Storage Credential)
CREATE EXTERNAL LOCATION IF NOT EXISTS ext_shipment_data
URL 's3://external-data-bucket/shipments/'
WITH (STORAGE CREDENTIAL my_aws_credential)
COMMENT '外部运输数据存储位置';

-- Step 3: 创建 External Table
CREATE TABLE logistics_prod.shipments.external_shipments
USING DELTA
LOCATION 's3://external-data-bucket/shipments/delta_table/';

-- 也可以用 Parquet 格式
CREATE TABLE logistics_prod.shipments.legacy_data
USING PARQUET
LOCATION 's3://legacy-bucket/old-shipments/';

4.5 用 PySpark 操作 Unity Catalog

# ============================================
# PySpark 读写 Unity Catalog 表
# ============================================

# 读取表 — 直接用三级命名
df = spark.table("logistics_prod.shipments.flight_delays")
df.show()

# 或者用 SQL
df = spark.sql("SELECT * FROM logistics_prod.shipments.flight_delays WHERE delay_minutes > 10")
df.show()

# 写入 Managed Table(覆盖模式)
from pyspark.sql import Row

new_data = spark.createDataFrame([
    Row(flight_id="AF7890", airline="Air France", origin="CDG",
        destination="FRA", delay_minutes=30, flight_date="2026-01-12")
])

new_data.write.mode("append").saveAsTable("logistics_prod.shipments.flight_delays")

# 写入新的 Managed Table(从 DataFrame 创建)
df_delayed = df.filter(df.delay_minutes > 20)
df_delayed.write.saveAsTable("logistics_prod.shipments.severely_delayed")

# 覆盖已有表
df_delayed.write.mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable("logistics_prod.shipments.severely_delayed")

4.6 创建和使用 Volume(管理非表格文件)

Volume 是 Unity Catalog 管理非结构化数据文件的方式(比如 CSV 上传文件、图片、模型文件等):[^2]

-- 创建 Managed Volume
CREATE VOLUME IF NOT EXISTS logistics_prod.shipments.raw_files
COMMENT '存放原始上传文件';

-- 创建 External Volume
CREATE EXTERNAL VOLUME logistics_prod.shipments.external_files
LOCATION 's3://external-data-bucket/raw-files/';
# 用 Python 读写 Volume 中的文件
# Volume 路径格式:/Volumes/atalog>/<schema>/<volume>/

import pandas as pd

# 写入文件到 Volume
pdf = pd.DataFrame({"col1": [1, 2, 3], "col2": ["a", "b", "c"]})
pdf.to_csv("/Volumes/logistics_prod/shipments/raw_files/sample.csv", index=False)

# 读取 Volume 中的文件
df = spark.read.csv(
    "/Volumes/logistics_prod/shipments/raw_files/sample.csv",
    header=True, inferSchema=True
)
df.show()

# 列出 Volume 中的文件
files = dbutils.fs.ls("/Volumes/logistics_prod/shipments/raw_files/")
display(files)

五、权限管理(Access Control)

5.1 权限模型

Unity Catalog 的权限基于 ANSI SQL 标准,采用继承机制 :[1][2]

Metastore 权限
    └── 继承到 → Catalog 权限
         └── 继承到 → Schema 权限
              └── 继承到 → Table/View/Volume 权限

核心权限列表

权限 作用
USE CATALOG 访问 Catalog 的前提
USE SCHEMA 访问 Schema 的前提
CREATE CATALOG 在 Metastore 下创建 Catalog
CREATE SCHEMA 在 Catalog 下创建 Schema
CREATE TABLE 在 Schema 下创建表
SELECT 读取表/视图数据
MODIFY INSERT/UPDATE/DELETE 数据
ALL PRIVILEGES 所有权限
MANAGE 管理权限 + 可以给别人授权

5.2 权限操作代码

-- ============================================
-- 授权(GRANT)
-- ============================================

-- 让 data_analysts 组能访问 Catalog
GRANT USE CATALOG ON CATALOG logistics_prod TO `data_analysts`;

-- 让 data_analysts 组能访问 Schema + 查询表
GRANT USE SCHEMA ON SCHEMA logistics_prod.shipments TO `data_analysts`;
GRANT SELECT ON TABLE logistics_prod.shipments.flight_delays TO `data_analysts`;

-- 让 data_engineers 组能建表
GRANT USE CATALOG ON CATALOG logistics_prod TO `data_engineers`;
GRANT USE SCHEMA, CREATE TABLE ON SCHEMA logistics_prod.shipments TO `data_engineers`;

-- 批量授权给整个 Catalog
GRANT CREATE SCHEMA, CREATE TABLE, USE CATALOG
ON CATALOG logistics_prod TO `account users`;

-- ============================================
-- 查看权限
-- ============================================
SHOW GRANTS ON CATALOG logistics_prod;
SHOW GRANTS ON TABLE logistics_prod.shipments.flight_delays;
SHOW GRANTS `data_analysts` ON SCHEMA logistics_prod.shipments;

-- ============================================
-- 撤销权限(REVOKE)
-- ============================================
REVOKE SELECT ON TABLE logistics_prod.shipments.flight_delays
FROM `data_analysts`;

-- ============================================
-- 转移所有权
-- ============================================
ALTER TABLE logistics_prod.shipments.flight_delays
SET OWNER TO `platform_team`;

关键注意事项:Unity Catalog 只能对 account-level groups 授权,不能用 workspace-level groups 。[^2]


六、Dynamic Views — 行级 / 列级安全

这是 Unity Catalog 实现细粒度数据安全的核心手段 。[10][11]

6.1 列级权限(Column-Level Security)

场景:auditors 组能看到完整 email,其他人只看到 REDACTED

CREATE OR REPLACE VIEW logistics_prod.shipments.customer_view AS
SELECT
    customer_id,
    full_name,
    CASE
        WHEN is_account_group_member('auditors') THEN email
        ELSE 'REDACTED'
    END AS email,
    CASE
        WHEN is_account_group_member('finance_team') THEN phone_number
        ELSE '***-***-****'
    END AS phone_number,
    order_total
FROM logistics_prod.shipments.customers_raw;

6.2 行级权限(Row-Level Security)

场景:managers 组能看所有数据,普通用户只能看金额 ≤ 100万的记录。

CREATE OR REPLACE VIEW logistics_prod.shipments.sales_filtered AS
SELECT
    user_id,
    region,
    product,
    total
FROM logistics_prod.shipments.sales_raw
WHERE
    CASE
        WHEN is_account_group_member('managers') THEN TRUE
        ELSE total <= 1000000
    END;

6.3 高级数据脱敏(Data Masking)

-- 非 auditors 组只能看到 email 的域名部分
CREATE OR REPLACE VIEW logistics_prod.shipments.sales_masked AS
SELECT
    user_id,
    region,
    CASE
        WHEN is_account_group_member('auditors') THEN email
        ELSE regexp_extract(email, '^.*@(.*)$', 1)
    END AS email_domain
FROM logistics_prod.shipments.sales_raw;

6.4 Row Filter(直接在表上设置,无需创建 View)

从 Databricks Runtime 较新版本开始,可以直接在表上设置 Row Filter :[12][13]

-- 创建过滤函数
CREATE OR REPLACE FUNCTION logistics_prod.shipments.region_filter(region_col STRING)
RETURNS BOOLEAN
RETURN
    is_account_group_member('global_access')
    OR region_col = (SELECT region FROM logistics_prod.shipments.user_regions
                     WHERE user_email = current_user());

-- 应用到表
ALTER TABLE logistics_prod.shipments.sales_raw
SET ROW FILTER logistics_prod.shipments.region_filter ON (region);

七、数据血缘(Data Lineage)

Unity Catalog 自动捕获运行时数据血缘,支持所有语言(Python、SQL、Scala),精确到列级别 。[4][3]

7.1 血缘覆盖范围

  • 表到表的读写依赖
  • Notebook、Job、Dashboard 与表的关联
  • 跨 Workspace 的血缘(同一 Metastore 下)
  • 血缘数据保留 1 年[^4]

7.2 查看血缘 — UI

在 Databricks Workspace 左侧导航栏点击 Catalog → 找到目标表 → 点击 Lineage Tab → 可以看到交互式血缘图 。[^3]

7.3 查看血缘 — System Tables(编程方式)

-- 表级血缘
SELECT
    source_table_full_name,
    target_table_full_name,
    source_type,
    target_type,
    entity_type,
    event_time
FROM system.access.table_lineage
WHERE target_table_full_name = 'logistics_prod.shipments.flight_delays'
ORDER BY event_time DESC
LIMIT 20;

-- 列级血缘
SELECT
    source_table_full_name,
    source_column_name,
    target_table_full_name,
    target_column_name,
    event_time
FROM system.access.column_lineage
WHERE target_table_full_name = 'logistics_prod.shipments.severely_delayed'
ORDER BY event_time DESC;

7.4 审计日志

-- 查看谁访问了你的表
SELECT
    event_time,
    user_identity.email AS user_email,
    action_name,
    request_params.full_name_arg AS table_name
FROM system.access.audit
WHERE action_name IN ('getTable', 'commandSubmit')
    AND request_params.full_name_arg LIKE 'logistics_prod.shipments%'
ORDER BY event_time DESC
LIMIT 50;

八、Storage Credential 和 External Location 详解

这是连接 Unity Catalog 和云存储的桥梁 。[^2]

8.1 架构关系

Storage Credential(云凭证)
    │
    ├── External Location A  → s3://bucket-a/path-a/
    ├── External Location B  → s3://bucket-a/path-b/
    └── External Location C  → s3://bucket-c/data/

一个 Storage Credential 可以被多个 External Location 引用 。[^2]

8.2 创建流程

-- Step 1: 创建 Storage Credential(AWS 示例)
CREATE STORAGE CREDENTIAL IF NOT EXISTS aws_logistics_cred
WITH (
    AWS_IAM_ROLE.ROLE_ARN = 'arn:aws:iam::123456789:role/unity-catalog-role'
)
COMMENT '物流数据存储凭证';

-- Step 2: 创建 External Location
CREATE EXTERNAL LOCATION IF NOT EXISTS logistics_external
URL 's3://logistics-data-lake/production/'
WITH (STORAGE CREDENTIAL aws_logistics_cred)
COMMENT '物流数据湖外部位置';

-- 验证
DESCRIBE STORAGE CREDENTIAL aws_logistics_cred;
DESCRIBE EXTERNAL LOCATION logistics_external;

-- 授权:允许 data_engineers 在此位置创建外部表
GRANT CREATE EXTERNAL TABLE ON EXTERNAL LOCATION logistics_external
TO `data_engineers`;

8.3 Managed Storage 层级

Managed Table 的存储位置按以下优先级决定 :[^2]

Schema 级别自定义位置(最高优先级)
    ↓ 如果没设置
Catalog 级别自定义位置
    ↓ 如果没设置
Metastore 默认位置(最低优先级)

九、Information Schema 查询

Unity Catalog 提供了完整的 information_schema 用于元数据查询:

-- 查看 Catalog 下所有的表
SELECT table_catalog, table_schema, table_name, table_type
FROM logistics_prod.information_schema.tables
ORDER BY table_schema, table_name;

-- 查看某个表的所有列
SELECT column_name, data_type, is_nullable, comment
FROM logistics_prod.information_schema.columns
WHERE table_schema = 'shipments'
  AND table_name = 'flight_delays';

-- 查看所有权限
SELECT *
FROM logistics_prod.information_schema.table_privileges
WHERE table_name = 'flight_delays';

十、ML 模型和 Feature Store 集成

Unity Catalog 也管理 ML 资产 :[^14]

import mlflow

# 设置 MLflow 使用 Unity Catalog 作为 Model Registry
mlflow.set_registry_uri("databricks-uc")

# 注册模型到 Unity Catalog
model_name = "logistics_prod.ml_models.delay_predictor"

with mlflow.start_run():
    # ... 训练模型 ...
    mlflow.sklearn.log_model(
        model,
        artifact_path="model",
        registered_model_name=model_name
    )

# 加载模型
loaded_model = mlflow.pyfunc.load_model(f"models:/{model_name}/1")

# Feature Table — 任何有 Primary Key 的 Delta 表都可以是 Feature Table
from databricks.feature_engineering import FeatureEngineeringClient

fe = FeatureEngineeringClient()

# 创建 Feature Table
fe.create_table(
    name="logistics_prod.features.flight_features",
    primary_keys=["flight_id"],
    df=feature_df,
    description="航班特征表"
)

十一、Workspace-Catalog Binding(环境隔离)

你可以把特定 Catalog 绑定到特定 Workspace,实现环境隔离 :[^2]

Workspace: prod-workspace  → 只能看到 logistics_prod catalog
Workspace: dev-workspace   → 只能看到 logistics_dev catalog

这在企业里很常见:生产数据和开发环境物理隔离,即使用户有权限,在开发 Workspace 也看不到生产 Catalog。


十二、Unity Catalog vs Hive Metastore 对比

维度 Hive Metastore Unity Catalog
命名空间 2级 (schema.table) 3级 (catalog.schema.table)
治理范围 单个 Workspace 跨 Workspace(同 Region)
权限模型 Workspace 级别 Account 级别
权限粒度 表级别 行/列级别
数据血缘 内建,列级别
审计日志 有限 完整,System Tables 可查
ML 模型管理 独立的 Model Registry 统一在 UC 内
多云支持 不支持 AWS, Azure, GCP
Delta Sharing 不支持 原生集成

[^1]


十三、Best Practices 总结

  1. Catalog 作为隔离单元 — 生产/开发/测试分 Catalog,敏感数据单独 Catalog[^1]
  2. 用 Group 而非个人作为 Owner — 方便管理,人走不影响权限[^1]
  3. 利用权限继承 — 在 Catalog/Schema 级别授权,自动传播到下层[^1]
  4. Managed Table 优先 — 除非有外部系统需要直接访问数据文件[^2]
  5. External Location 权限收紧 — 只给少数管理员创建 External Location 的权限[^1]
  6. 启用 System Tables — 定期审计数据访问和血缘[^15]
  7. Dynamic Views 做细粒度安全 — 比建多张表维护成本低得多[^10]
  8. Workspace-Catalog Binding — 生产环境强隔离[^2]

十四、Quick Reference 速查

-- ===== 环境探索 =====
SELECT CURRENT_METASTORE();
SELECT CURRENT_CATALOG();
SELECT CURRENT_SCHEMA();
SHOW CATALOGS;
SHOW SCHEMAS IN my_catalog;
SHOW TABLES IN my_catalog.my_schema;

-- ===== 创建对象 =====
CREATE CATALOG IF NOT EXISTS my_catalog;
CREATE SCHEMA IF NOT EXISTS my_catalog.my_schema;
CREATE TABLE my_catalog.my_schema.my_table (id INT, name STRING);
CREATE VOLUME my_catalog.my_schema.my_volume;

-- ===== 权限管理 =====
GRANT USE CATALOG ON CATALOG my_catalog TO `group_name`;
GRANT SELECT ON TABLE my_catalog.my_schema.my_table TO `group_name`;
SHOW GRANTS ON TABLE my_catalog.my_schema.my_table;
REVOKE SELECT ON TABLE my_catalog.my_schema.my_table FROM `group_name`;

-- ===== 表操作 =====
DESCRIBE TABLE EXTENDED my_catalog.my_schema.my_table;
DESCRIBE DETAIL my_catalog.my_schema.my_table;
DESCRIBE HISTORY my_catalog.my_schema.my_table;
OPTIMIZE my_catalog.my_schema.my_table;

-- ===== 血缘查询 =====
SELECT * FROM system.access.table_lineage WHERE target_table_full_name = '...';
SELECT * FROM system.access.column_lineage WHERE target_table_full_name = '...';

这份教程覆盖了 Unity Catalog 从架构设计到代码实战的完整路径。结合你已有的 Databricks AI Engineer 认证基础,理解这些内容后可以在 ML Professional 认证考试中自信应对 Unity Catalog 相关的治理和安全问题。


References

  1. Databricks Unity Catalog 101: An In-Depth Guide (2025) - Databricks Unity Catalog—a unified data governance solution offering centralized discovery, access c...
  2. What is Unity Catalog? | Databricks on AWS - Learn how to perform data governance in Databricks using Unity Catalog.
  3. Get table lineage using... - Learn how to use Unity Catalog to view and analyze data lineage.
  4. View data lineage using Unity Catalog - Azure Databricks - Learn how to use Unity Catalog to view and analyze data lineage.
  5. What is Unity Catalog? - Azure Databricks - Microsoft Learn - Learn how to perform data governance in Azure Databricks using Unity Catalog.
  6. Read table from Unity Catalog and write table to Unity Catalog - To read from and write to Unity Catalog in PySpark, you typically work with tables registered in the...
  7. A Deep Dive into Unity Catalog's Architecture and Design - CelerData - Understand Unity Catalog's architecture, including its metastore, governance features, and scalabili...
  8. Dropping Managed vs... - Learn the differences between managed and external tables in Unity Catalog. Discover when to use eac...
  9. Databricks Unity Catalog SQL Commands - This post is basic commands you will need to know for working with Unity Catalog. Display Current Me...
  10. Create a dynamic view | Databricks on AWS - Learn what a dynamic view is on Databricks.
  11. Dynamic Views and Column-Level Security with Unity Catalog - Unity Catalog allows you to create a single view that presents different data based on the user or g...
  12. Unity Catalog: Secure Data with Row and Column Level ... - Implementing Row and Column Level Security in Unity Catalog Your data warehouse contains sensitive i...
  13. 36 Row Level Filters in UC | Filter Sensitive Data in Unity Catalog table using Row Level Security - Video explains - How to filter sensitive data in tables in Unity Catalog? How to apply Row Level Fil...
  14. Technical Deep Dive for Practitioners: Databricks Unity Catalog from A-Z - Get ready to take a deep dive into Unity Catalog and explore how it can simplify data, analytics and...
  15. Lineage system tables reference | Databricks on AWS - Learn how view and analyze the table lineage and column lineage system tables.