Databricks Unity Catalog 完整 Hands-On 教程
一、Unity Catalog 到底是什么?
Unity Catalog 是 Databricks 在 2021 年 Data+AI Summit 上发布的统一数据治理层,解决的核心问题是:在多 Workspace、多云、多团队环境下,如何用一个地方管好所有数据和 AI 资产的元数据、权限、审计和血缘 。[1][2]
在 Unity Catalog 之前,Databricks 用的是 Hive Metastore,每个 Workspace 有自己独立的 metastore,权限管不了跨 Workspace,治理全靠第三方工具。Unity Catalog 把这些问题一次性解决了 。[^1]
核心能力一句话概括:
- Define once, secure everywhere — 权限定义一次,所有 Workspace 生效[^2]
- ANSI SQL 权限模型 — 用标准 SQL 的 GRANT/REVOKE 管权限[^2]
- 内建审计和血缘 — 自动记录谁访问了什么数据,数据从哪来到哪去[3][4]
- 数据发现 — 可以打 Tag、搜索、浏览所有数据资产[^2]
二、架构与对象模型(Object Model)
2.1 三级命名空间
Unity Catalog 的核心设计是一个三级命名空间(Three-Level Namespace):[5][2]
Metastore
└── Catalog (第1级:逻辑隔离单元,类似"数据库集群")
└── Schema (第2级:也叫 Database,逻辑分组)
├── Table (第3级:结构化数据)
├── View (第3级:保存的查询)
├── Volume (第3级:非表格化数据,如文件)
├── Function(第3级:UDF)
└── Model (第3级:MLflow 注册的 ML 模型)
引用一个表的方式:
catalog_name.schema_name.table_name
比如 sales_prod.orders.orders_delta 就是在 sales_prod 目录下的 orders Schema 里的 orders_delta 表 。[6][1]
2.2 Metastore:顶层容器
Metastore 是 Unity Catalog 的最顶层容器,负责存储所有元数据并管理权限。每个 Region 应该有一个 Metastore,多个 Workspace 可以共享同一个 Metastore 。[7][2]
关键区别:Hive Metastore 是每个 Workspace 一个,Unity Catalog 的 Metastore 是跨 Workspace 共享的,这是治理能力的根本来源。
2.3 完整对象层级图
┌────────────────────────────────────┐
│ Metastore │ ← 区域级,一个 region 一个
│ ┌──────────────────────────────┐ │
│ │ Storage Credentials │ │ ← 云存储凭证(IAM Role/SPN)
│ │ External Locations │ │ ← 云存储路径 + 凭证绑定
│ │ Connections (Federation) │ │ ← 外部数据库连接
│ ├──────────────────────────────┤ │
│ │ Catalog: sales_prod │ │ ← 第1级
│ │ ├── Schema: orders │ │ ← 第2级
│ │ │ ├── Table (managed) │ │ ← 第3级
│ │ │ ├── Table (external) │ │
│ │ │ ├── View │ │
│ │ │ ├── Volume │ │
│ │ │ ├── Function (UDF) │ │
│ │ │ └── Model (MLflow) │ │
│ │ └── Schema: analytics │ │
│ ├──────────────────────────────┤ │
│ │ Catalog: dev │ │
│ │ └── ... │ │
│ └──────────────────────────────┘ │
└────────────────────────────────────┘
三、Managed Table vs External Table
这是 Unity Catalog 中最核心的概念区分之一 :[8][2]
| 特性 | Managed Table | External Table |
|---|---|---|
| 数据存储 | Unity Catalog 管理的存储位置 | 你指定的外部云存储路径 |
| 文件格式 | 仅 Delta Lake | Delta, CSV, JSON, Parquet, Avro, ORC, Text |
| 生命周期 | DROP TABLE = 删元数据 + 删数据文件 | DROP TABLE = 只删元数据,数据文件保留 |
| 典型场景 | 新建的表,全部在 Databricks 内管理 | 已有数据在 S3/ADLS,注册进来治理 |
| 推荐程度 | ✅ 推荐大多数场景 | 需要 Databricks 外部也能访问数据时使用 |
四、Hands-On 实战代码
以下代码可以直接在 Databricks Notebook 中运行。
4.1 探索当前环境
-- 查看当前 Metastore
SELECT CURRENT_METASTORE();
-- 查看当前 Catalog
SELECT CURRENT_CATALOG();
-- 查看当前 Schema
SELECT CURRENT_SCHEMA();
-- 列出所有 Catalog
SHOW CATALOGS;
-- 列出当前 Catalog 下的所有 Schema
SHOW SCHEMAS;
4.2 创建 Catalog 和 Schema
-- 创建 Catalog(需要 metastore admin 或 CREATE CATALOG 权限)
CREATE CATALOG IF NOT EXISTS logistics_prod
COMMENT '物流生产环境数据目录';
-- 切换到新建的 Catalog
USE CATALOG logistics_prod;
-- 创建 Schema
CREATE SCHEMA IF NOT EXISTS shipments
COMMENT '运输数据模块';
-- 创建带自定义存储位置的 Schema(数据物理隔离)
CREATE SCHEMA IF NOT EXISTS sensitive_data
MANAGED LOCATION 's3://my-secure-bucket/sensitive/'
COMMENT '敏感数据,物理隔离存储';
-- 查看 Schema 详情
DESCRIBE SCHEMA EXTENDED shipments;
如果你需要用 带自定义存储位置的 Catalog:
CREATE CATALOG IF NOT EXISTS hr_prod
MANAGED LOCATION 's3://mycompany-hr-prod/unity-catalog/'
COMMENT 'HR 生产数据,合规要求独立存储';
4.3 创建 Managed Table
USE CATALOG logistics_prod;
USE SCHEMA shipments;
-- 创建 Managed Delta 表
CREATE TABLE IF NOT EXISTS flight_delays (
flight_id STRING,
airline STRING,
origin STRING,
destination STRING,
delay_minutes INT,
flight_date DATE
)
COMMENT '航班延误数据'
PARTITIONED BY (flight_date);
-- 插入数据
INSERT INTO flight_delays VALUES
('LH1234', 'Lufthansa', 'FRA', 'JFK', 15, '2026-01-10'),
('UA5678', 'United', 'EWR', 'FRA', -5, '2026-01-10'),
('LH9012', 'Lufthansa', 'FRA', 'NRT', 45, '2026-01-11'),
('BA3456', 'British Airways', 'LHR', 'FRA', 0, '2026-01-11');
-- 查看表数据
SELECT * FROM flight_delays;
-- 查看表的详细信息(包括存储位置、格式等)
DESCRIBE TABLE EXTENDED flight_delays;
-- 查看表的物理存储详情
DESCRIBE DETAIL flight_delays;
-- 查看表的版本历史(Delta Lake 特性)
DESCRIBE HISTORY flight_delays;
4.4 创建 External Table
External Table 需要先设置 Storage Credential 和 External Location 。[9][2]
-- Step 1: 查看已有的 Storage Credentials
SHOW STORAGE CREDENTIALS;
-- Step 2: 创建 External Location(需要已有 Storage Credential)
CREATE EXTERNAL LOCATION IF NOT EXISTS ext_shipment_data
URL 's3://external-data-bucket/shipments/'
WITH (STORAGE CREDENTIAL my_aws_credential)
COMMENT '外部运输数据存储位置';
-- Step 3: 创建 External Table
CREATE TABLE logistics_prod.shipments.external_shipments
USING DELTA
LOCATION 's3://external-data-bucket/shipments/delta_table/';
-- 也可以用 Parquet 格式
CREATE TABLE logistics_prod.shipments.legacy_data
USING PARQUET
LOCATION 's3://legacy-bucket/old-shipments/';
4.5 用 PySpark 操作 Unity Catalog
# ============================================
# PySpark 读写 Unity Catalog 表
# ============================================
# 读取表 — 直接用三级命名
df = spark.table("logistics_prod.shipments.flight_delays")
df.show()
# 或者用 SQL
df = spark.sql("SELECT * FROM logistics_prod.shipments.flight_delays WHERE delay_minutes > 10")
df.show()
# 写入 Managed Table(覆盖模式)
from pyspark.sql import Row
new_data = spark.createDataFrame([
Row(flight_id="AF7890", airline="Air France", origin="CDG",
destination="FRA", delay_minutes=30, flight_date="2026-01-12")
])
new_data.write.mode("append").saveAsTable("logistics_prod.shipments.flight_delays")
# 写入新的 Managed Table(从 DataFrame 创建)
df_delayed = df.filter(df.delay_minutes > 20)
df_delayed.write.saveAsTable("logistics_prod.shipments.severely_delayed")
# 覆盖已有表
df_delayed.write.mode("overwrite") \
.option("overwriteSchema", "true") \
.saveAsTable("logistics_prod.shipments.severely_delayed")
4.6 创建和使用 Volume(管理非表格文件)
Volume 是 Unity Catalog 管理非结构化数据文件的方式(比如 CSV 上传文件、图片、模型文件等):[^2]
-- 创建 Managed Volume
CREATE VOLUME IF NOT EXISTS logistics_prod.shipments.raw_files
COMMENT '存放原始上传文件';
-- 创建 External Volume
CREATE EXTERNAL VOLUME logistics_prod.shipments.external_files
LOCATION 's3://external-data-bucket/raw-files/';
# 用 Python 读写 Volume 中的文件
# Volume 路径格式:/Volumes/atalog>/<schema>/<volume>/
import pandas as pd
# 写入文件到 Volume
pdf = pd.DataFrame({"col1": [1, 2, 3], "col2": ["a", "b", "c"]})
pdf.to_csv("/Volumes/logistics_prod/shipments/raw_files/sample.csv", index=False)
# 读取 Volume 中的文件
df = spark.read.csv(
"/Volumes/logistics_prod/shipments/raw_files/sample.csv",
header=True, inferSchema=True
)
df.show()
# 列出 Volume 中的文件
files = dbutils.fs.ls("/Volumes/logistics_prod/shipments/raw_files/")
display(files)
五、权限管理(Access Control)
5.1 权限模型
Unity Catalog 的权限基于 ANSI SQL 标准,采用继承机制 :[1][2]
Metastore 权限
└── 继承到 → Catalog 权限
└── 继承到 → Schema 权限
└── 继承到 → Table/View/Volume 权限
核心权限列表:
| 权限 | 作用 |
|---|---|
USE CATALOG |
访问 Catalog 的前提 |
USE SCHEMA |
访问 Schema 的前提 |
CREATE CATALOG |
在 Metastore 下创建 Catalog |
CREATE SCHEMA |
在 Catalog 下创建 Schema |
CREATE TABLE |
在 Schema 下创建表 |
SELECT |
读取表/视图数据 |
MODIFY |
INSERT/UPDATE/DELETE 数据 |
ALL PRIVILEGES |
所有权限 |
MANAGE |
管理权限 + 可以给别人授权 |
5.2 权限操作代码
-- ============================================
-- 授权(GRANT)
-- ============================================
-- 让 data_analysts 组能访问 Catalog
GRANT USE CATALOG ON CATALOG logistics_prod TO `data_analysts`;
-- 让 data_analysts 组能访问 Schema + 查询表
GRANT USE SCHEMA ON SCHEMA logistics_prod.shipments TO `data_analysts`;
GRANT SELECT ON TABLE logistics_prod.shipments.flight_delays TO `data_analysts`;
-- 让 data_engineers 组能建表
GRANT USE CATALOG ON CATALOG logistics_prod TO `data_engineers`;
GRANT USE SCHEMA, CREATE TABLE ON SCHEMA logistics_prod.shipments TO `data_engineers`;
-- 批量授权给整个 Catalog
GRANT CREATE SCHEMA, CREATE TABLE, USE CATALOG
ON CATALOG logistics_prod TO `account users`;
-- ============================================
-- 查看权限
-- ============================================
SHOW GRANTS ON CATALOG logistics_prod;
SHOW GRANTS ON TABLE logistics_prod.shipments.flight_delays;
SHOW GRANTS `data_analysts` ON SCHEMA logistics_prod.shipments;
-- ============================================
-- 撤销权限(REVOKE)
-- ============================================
REVOKE SELECT ON TABLE logistics_prod.shipments.flight_delays
FROM `data_analysts`;
-- ============================================
-- 转移所有权
-- ============================================
ALTER TABLE logistics_prod.shipments.flight_delays
SET OWNER TO `platform_team`;
关键注意事项:Unity Catalog 只能对 account-level groups 授权,不能用 workspace-level groups 。[^2]
六、Dynamic Views — 行级 / 列级安全
这是 Unity Catalog 实现细粒度数据安全的核心手段 。[10][11]
6.1 列级权限(Column-Level Security)
场景:auditors 组能看到完整 email,其他人只看到 REDACTED。
CREATE OR REPLACE VIEW logistics_prod.shipments.customer_view AS
SELECT
customer_id,
full_name,
CASE
WHEN is_account_group_member('auditors') THEN email
ELSE 'REDACTED'
END AS email,
CASE
WHEN is_account_group_member('finance_team') THEN phone_number
ELSE '***-***-****'
END AS phone_number,
order_total
FROM logistics_prod.shipments.customers_raw;
6.2 行级权限(Row-Level Security)
场景:managers 组能看所有数据,普通用户只能看金额 ≤ 100万的记录。
CREATE OR REPLACE VIEW logistics_prod.shipments.sales_filtered AS
SELECT
user_id,
region,
product,
total
FROM logistics_prod.shipments.sales_raw
WHERE
CASE
WHEN is_account_group_member('managers') THEN TRUE
ELSE total <= 1000000
END;
6.3 高级数据脱敏(Data Masking)
-- 非 auditors 组只能看到 email 的域名部分
CREATE OR REPLACE VIEW logistics_prod.shipments.sales_masked AS
SELECT
user_id,
region,
CASE
WHEN is_account_group_member('auditors') THEN email
ELSE regexp_extract(email, '^.*@(.*)$', 1)
END AS email_domain
FROM logistics_prod.shipments.sales_raw;
6.4 Row Filter(直接在表上设置,无需创建 View)
从 Databricks Runtime 较新版本开始,可以直接在表上设置 Row Filter :[12][13]
-- 创建过滤函数
CREATE OR REPLACE FUNCTION logistics_prod.shipments.region_filter(region_col STRING)
RETURNS BOOLEAN
RETURN
is_account_group_member('global_access')
OR region_col = (SELECT region FROM logistics_prod.shipments.user_regions
WHERE user_email = current_user());
-- 应用到表
ALTER TABLE logistics_prod.shipments.sales_raw
SET ROW FILTER logistics_prod.shipments.region_filter ON (region);
七、数据血缘(Data Lineage)
Unity Catalog 自动捕获运行时数据血缘,支持所有语言(Python、SQL、Scala),精确到列级别 。[4][3]
7.1 血缘覆盖范围
- 表到表的读写依赖
- Notebook、Job、Dashboard 与表的关联
- 跨 Workspace 的血缘(同一 Metastore 下)
- 血缘数据保留 1 年[^4]
7.2 查看血缘 — UI
在 Databricks Workspace 左侧导航栏点击 Catalog → 找到目标表 → 点击 Lineage Tab → 可以看到交互式血缘图 。[^3]
7.3 查看血缘 — System Tables(编程方式)
-- 表级血缘
SELECT
source_table_full_name,
target_table_full_name,
source_type,
target_type,
entity_type,
event_time
FROM system.access.table_lineage
WHERE target_table_full_name = 'logistics_prod.shipments.flight_delays'
ORDER BY event_time DESC
LIMIT 20;
-- 列级血缘
SELECT
source_table_full_name,
source_column_name,
target_table_full_name,
target_column_name,
event_time
FROM system.access.column_lineage
WHERE target_table_full_name = 'logistics_prod.shipments.severely_delayed'
ORDER BY event_time DESC;
7.4 审计日志
-- 查看谁访问了你的表
SELECT
event_time,
user_identity.email AS user_email,
action_name,
request_params.full_name_arg AS table_name
FROM system.access.audit
WHERE action_name IN ('getTable', 'commandSubmit')
AND request_params.full_name_arg LIKE 'logistics_prod.shipments%'
ORDER BY event_time DESC
LIMIT 50;
八、Storage Credential 和 External Location 详解
这是连接 Unity Catalog 和云存储的桥梁 。[^2]
8.1 架构关系
Storage Credential(云凭证)
│
├── External Location A → s3://bucket-a/path-a/
├── External Location B → s3://bucket-a/path-b/
└── External Location C → s3://bucket-c/data/
一个 Storage Credential 可以被多个 External Location 引用 。[^2]
8.2 创建流程
-- Step 1: 创建 Storage Credential(AWS 示例)
CREATE STORAGE CREDENTIAL IF NOT EXISTS aws_logistics_cred
WITH (
AWS_IAM_ROLE.ROLE_ARN = 'arn:aws:iam::123456789:role/unity-catalog-role'
)
COMMENT '物流数据存储凭证';
-- Step 2: 创建 External Location
CREATE EXTERNAL LOCATION IF NOT EXISTS logistics_external
URL 's3://logistics-data-lake/production/'
WITH (STORAGE CREDENTIAL aws_logistics_cred)
COMMENT '物流数据湖外部位置';
-- 验证
DESCRIBE STORAGE CREDENTIAL aws_logistics_cred;
DESCRIBE EXTERNAL LOCATION logistics_external;
-- 授权:允许 data_engineers 在此位置创建外部表
GRANT CREATE EXTERNAL TABLE ON EXTERNAL LOCATION logistics_external
TO `data_engineers`;
8.3 Managed Storage 层级
Managed Table 的存储位置按以下优先级决定 :[^2]
Schema 级别自定义位置(最高优先级)
↓ 如果没设置
Catalog 级别自定义位置
↓ 如果没设置
Metastore 默认位置(最低优先级)
九、Information Schema 查询
Unity Catalog 提供了完整的 information_schema 用于元数据查询:
-- 查看 Catalog 下所有的表
SELECT table_catalog, table_schema, table_name, table_type
FROM logistics_prod.information_schema.tables
ORDER BY table_schema, table_name;
-- 查看某个表的所有列
SELECT column_name, data_type, is_nullable, comment
FROM logistics_prod.information_schema.columns
WHERE table_schema = 'shipments'
AND table_name = 'flight_delays';
-- 查看所有权限
SELECT *
FROM logistics_prod.information_schema.table_privileges
WHERE table_name = 'flight_delays';
十、ML 模型和 Feature Store 集成
Unity Catalog 也管理 ML 资产 :[^14]
import mlflow
# 设置 MLflow 使用 Unity Catalog 作为 Model Registry
mlflow.set_registry_uri("databricks-uc")
# 注册模型到 Unity Catalog
model_name = "logistics_prod.ml_models.delay_predictor"
with mlflow.start_run():
# ... 训练模型 ...
mlflow.sklearn.log_model(
model,
artifact_path="model",
registered_model_name=model_name
)
# 加载模型
loaded_model = mlflow.pyfunc.load_model(f"models:/{model_name}/1")
# Feature Table — 任何有 Primary Key 的 Delta 表都可以是 Feature Table
from databricks.feature_engineering import FeatureEngineeringClient
fe = FeatureEngineeringClient()
# 创建 Feature Table
fe.create_table(
name="logistics_prod.features.flight_features",
primary_keys=["flight_id"],
df=feature_df,
description="航班特征表"
)
十一、Workspace-Catalog Binding(环境隔离)
你可以把特定 Catalog 绑定到特定 Workspace,实现环境隔离 :[^2]
Workspace: prod-workspace → 只能看到 logistics_prod catalog
Workspace: dev-workspace → 只能看到 logistics_dev catalog
这在企业里很常见:生产数据和开发环境物理隔离,即使用户有权限,在开发 Workspace 也看不到生产 Catalog。
十二、Unity Catalog vs Hive Metastore 对比
| 维度 | Hive Metastore | Unity Catalog |
|---|---|---|
| 命名空间 | 2级 (schema.table) |
3级 (catalog.schema.table) |
| 治理范围 | 单个 Workspace | 跨 Workspace(同 Region) |
| 权限模型 | Workspace 级别 | Account 级别 |
| 权限粒度 | 表级别 | 行/列级别 |
| 数据血缘 | 无 | 内建,列级别 |
| 审计日志 | 有限 | 完整,System Tables 可查 |
| ML 模型管理 | 独立的 Model Registry | 统一在 UC 内 |
| 多云支持 | 不支持 | AWS, Azure, GCP |
| Delta Sharing | 不支持 | 原生集成 |
[^1]
十三、Best Practices 总结
- Catalog 作为隔离单元 — 生产/开发/测试分 Catalog,敏感数据单独 Catalog[^1]
- 用 Group 而非个人作为 Owner — 方便管理,人走不影响权限[^1]
- 利用权限继承 — 在 Catalog/Schema 级别授权,自动传播到下层[^1]
- Managed Table 优先 — 除非有外部系统需要直接访问数据文件[^2]
- External Location 权限收紧 — 只给少数管理员创建 External Location 的权限[^1]
- 启用 System Tables — 定期审计数据访问和血缘[^15]
- Dynamic Views 做细粒度安全 — 比建多张表维护成本低得多[^10]
- Workspace-Catalog Binding — 生产环境强隔离[^2]
十四、Quick Reference 速查
-- ===== 环境探索 =====
SELECT CURRENT_METASTORE();
SELECT CURRENT_CATALOG();
SELECT CURRENT_SCHEMA();
SHOW CATALOGS;
SHOW SCHEMAS IN my_catalog;
SHOW TABLES IN my_catalog.my_schema;
-- ===== 创建对象 =====
CREATE CATALOG IF NOT EXISTS my_catalog;
CREATE SCHEMA IF NOT EXISTS my_catalog.my_schema;
CREATE TABLE my_catalog.my_schema.my_table (id INT, name STRING);
CREATE VOLUME my_catalog.my_schema.my_volume;
-- ===== 权限管理 =====
GRANT USE CATALOG ON CATALOG my_catalog TO `group_name`;
GRANT SELECT ON TABLE my_catalog.my_schema.my_table TO `group_name`;
SHOW GRANTS ON TABLE my_catalog.my_schema.my_table;
REVOKE SELECT ON TABLE my_catalog.my_schema.my_table FROM `group_name`;
-- ===== 表操作 =====
DESCRIBE TABLE EXTENDED my_catalog.my_schema.my_table;
DESCRIBE DETAIL my_catalog.my_schema.my_table;
DESCRIBE HISTORY my_catalog.my_schema.my_table;
OPTIMIZE my_catalog.my_schema.my_table;
-- ===== 血缘查询 =====
SELECT * FROM system.access.table_lineage WHERE target_table_full_name = '...';
SELECT * FROM system.access.column_lineage WHERE target_table_full_name = '...';
这份教程覆盖了 Unity Catalog 从架构设计到代码实战的完整路径。结合你已有的 Databricks AI Engineer 认证基础,理解这些内容后可以在 ML Professional 认证考试中自信应对 Unity Catalog 相关的治理和安全问题。
References
- Databricks Unity Catalog 101: An In-Depth Guide (2025) - Databricks Unity Catalog—a unified data governance solution offering centralized discovery, access c...
- What is Unity Catalog? | Databricks on AWS - Learn how to perform data governance in Databricks using Unity Catalog.
- Get table lineage using... - Learn how to use Unity Catalog to view and analyze data lineage.
- View data lineage using Unity Catalog - Azure Databricks - Learn how to use Unity Catalog to view and analyze data lineage.
- What is Unity Catalog? - Azure Databricks - Microsoft Learn - Learn how to perform data governance in Azure Databricks using Unity Catalog.
- Read table from Unity Catalog and write table to Unity Catalog - To read from and write to Unity Catalog in PySpark, you typically work with tables registered in the...
- A Deep Dive into Unity Catalog's Architecture and Design - CelerData - Understand Unity Catalog's architecture, including its metastore, governance features, and scalabili...
- Dropping Managed vs... - Learn the differences between managed and external tables in Unity Catalog. Discover when to use eac...
- Databricks Unity Catalog SQL Commands - This post is basic commands you will need to know for working with Unity Catalog. Display Current Me...
- Create a dynamic view | Databricks on AWS - Learn what a dynamic view is on Databricks.
- Dynamic Views and Column-Level Security with Unity Catalog - Unity Catalog allows you to create a single view that presents different data based on the user or g...
- Unity Catalog: Secure Data with Row and Column Level ... - Implementing Row and Column Level Security in Unity Catalog Your data warehouse contains sensitive i...
- 36 Row Level Filters in UC | Filter Sensitive Data in Unity Catalog table using Row Level Security - Video explains - How to filter sensitive data in tables in Unity Catalog? How to apply Row Level Fil...
- Technical Deep Dive for Practitioners: Databricks Unity Catalog from A-Z - Get ready to take a deep dive into Unity Catalog and explore how it can simplify data, analytics and...
- Lineage system tables reference | Databricks on AWS - Learn how view and analyze the table lineage and column lineage system tables.