SoloLakehouse → 开源 Databricks 平台 完整方案

SoloLakehouse → 开源 Databricks 平台 完整方案

一、现状分析与差距

✅ 你已经拥有的

Databricks 功能 你的现有组件 状态
Object Storage (DBFS) MinIO ✅ 完成
Metastore Hive Metastore + PostgreSQL ✅ 完成
SQL Warehouse Trino ✅ 完成
SQL Editor CloudBeaver ✅ 完成
Experiment Tracking MLflow ✅ 完成
Model Registry MLflow ✅ 完成
Monitoring Beszel + Prometheus + Grafana ✅ 完成
Reverse Proxy Nginx Proxy Manager ✅ 完成
Docs MkDocs ✅ 完成

❌ 需要补齐的核心能力

Databricks 功能 缺失组件 优先级
Spark 计算引擎 Apache Spark (核心!) 🔴 P0
Notebook 交互开发 JupyterHub / JupyterLab 🔴 P0
Lakehouse 表格式 Apache Iceberg / Delta Lake 🔴 P0
工作流编排 (Jobs) Apache Airflow / Dagster 🔴 P0
数据目录 (Unity Catalog) Apache Gravitino / Polaris 🟡 P1
流处理 (Structured Streaming) Spark Streaming + Kafka/Redpanda 🟡 P1
数据摄入 (Auto Loader) 自定义 Spark ingestion 🟡 P1
Feature Store Feast 🟢 P2
模型推理服务 (Serving) Seldon / BentoML / Ray Serve 🟢 P2
Delta Sharing delta-sharing-server 🟢 P2
Secrets Management HashiCorp Vault 🟢 P2

二、目标架构总览

┌──────────────────────────────────────────────────────────────────────┐
│                        Nginx Proxy Manager                          │
│              (*.sololake.space → 各服务反代 + TLS)                   │
└──────┬───────┬───────┬───────┬───────┬───────┬───────┬──────────────┘
       │       │       │       │       │       │       │
  Jupyter  CloudBeaver Airflow  MLflow  Grafana Spark-UI MkDocs
       │       │       │       │       │       │       │
┌──────┴───────┴───────┴───────┴───────┴───────┴───────┴──────────────┐
│                         Docker 内网 (lake_net)                       │
│                                                                      │
│  ┌─────────────┐  ┌──────────────┐  ┌──────────────┐                │
│  │ Spark Master │  │ Spark Worker │  │ Spark Worker │   (计算层)     │
│  │   + Thrift   │  │    ×1~N      │  │    ×1~N      │                │
│  └──────┬──────┘  └──────┬───────┘  └──────┬───────┘                │
│         │                │                  │                        │
│  ┌──────┴────────────────┴──────────────────┴───────┐                │
│  │              Apache Iceberg (表格式)               │   (格式层)   │
│  │         Hive Metastore  ←→  PostgreSQL            │   (目录层)   │
│  └──────────────────┬───────────────────────────────┘                │
│                     │                                                │
│  ┌──────────────────┴───────────────────────────────┐                │
│  │               MinIO (S3-compatible)               │   (存储层)   │
│  │     s3://warehouse/  s3://mlflow/  s3://raw/      │                │
│  └──────────────────────────────────────────────────┘                │
│                                                                      │
│  ┌────────┐ ┌────────┐ ┌──────────┐ ┌───────────┐ ┌──────────┐     │
│  │ Trino  │ │Airflow │ │  MLflow  │ │Prometheus │ │ Grafana  │     │
│  │(Ad-hoc)│ │(调度)  │ │(实验)    │ │(指标)     │ │(看板)    │     │
│  └────────┘ └────────┘ └──────────┘ └───────────┘ └──────────┘     │
└─────────────────────────────────────────────────────────────────────┘

三、分阶段实施计划

Phase 1:核心计算与 Lakehouse(P0,1-2 周)

1.1 Apache Spark(Master + Worker)

Spark 是 Databricks 的核心。在 docker-compose 中加入 Spark standalone 集群:

  # =========================
  # Spark Master
  # =========================
  spark-master:
    image: bitnami/spark:3.5
    container_name: sololakehouse-spark-master
    restart: unless-stopped
    environment:
      - SPARK_MODE=master
      - SPARK_MASTER_HOST=spark-master
      - SPARK_MASTER_PORT=7077
      - SPARK_MASTER_WEBUI_PORT=8081
      # Iceberg + S3 配置
      - SPARK_CONF_spark.sql.catalog.lakehouse=org.apache.iceberg.spark.SparkCatalog
      - SPARK_CONF_spark.sql.catalog.lakehouse.type=hive
      - SPARK_CONF_spark.sql.catalog.lakehouse.uri=thrift://hive-metastore:9083
      - SPARK_CONF_spark.sql.catalog.lakehouse.warehouse=s3a://warehouse/
      - SPARK_CONF_spark.hadoop.fs.s3a.endpoint=http://minio:9000
      - SPARK_CONF_spark.hadoop.fs.s3a.access.key=${MINIO_ROOT_USER}
      - SPARK_CONF_spark.hadoop.fs.s3a.secret.key=${MINIO_ROOT_PASSWORD}
      - SPARK_CONF_spark.hadoop.fs.s3a.path.style.access=true
      - SPARK_CONF_spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
      - SPARK_CONF_spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
      - SPARK_CONF_spark.sql.defaultCatalog=lakehouse
    ports:
      - "127.0.0.1:8081:8081"   # Spark Master Web UI
      - "127.0.0.1:7077:7077"   # Spark Master Port
    volumes:
      - ./spark/jars:/opt/bitnami/spark/ivy-jars   # 额外 JAR 包
      - ./data/spark/events:/tmp/spark-events       # Event logs
    networks:
      - lake_net

  # =========================
  # Spark Worker(s)
  # =========================
  spark-worker-1:
    image: bitnami/spark:3.5
    container_name: sololakehouse-spark-worker-1
    restart: unless-stopped
    depends_on:
      - spark-master
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark-master:7077
      - SPARK_WORKER_CORES=4       # 根据你的机器调整
      - SPARK_WORKER_MEMORY=8G     # 根据你的机器调整
      - SPARK_WORKER_WEBUI_PORT=8082
    ports:
      - "127.0.0.1:8082:8082"
    volumes:
      - ./spark/jars:/opt/bitnami/spark/ivy-jars
      - ./data/spark/events:/tmp/spark-events
    networks:
      - lake_net

需要的额外 JAR 包(放在 ./spark/jars/ 目录下):

mkdir -p spark/jars && cd spark/jars

# Iceberg runtime for Spark 3.5
wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.7.1/iceberg-spark-runtime-3.5_2.12-1.7.1.jar

# AWS S3 bundle (for MinIO)
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar

1.2 Apache Iceberg(Lakehouse 表格式)

Iceberg 比 Delta Lake 更适合多引擎架构(Spark + Trino 同时读写同一张表)。

Trino 侧配置 — 修改 trino/etc/catalog/iceberg.properties

connector.name=iceberg
iceberg.catalog.type=hive_metastore
hive.metastore.uri=thrift://hive-metastore:9083
hive.s3.endpoint=http://minio:9000
hive.s3.aws-access-key=${ENV:MINIO_ROOT_USER}
hive.s3.aws-secret-key=${ENV:MINIO_ROOT_PASSWORD}
hive.s3.path-style-access=true

MinIO Bucket 初始化(添加 init 容器或手动执行):

  # =========================
  # MinIO Init (create buckets)
  # =========================
  minio-init:
    image: minio/mc:latest
    container_name: sololakehouse-minio-init
    depends_on:
      - minio
    entrypoint: /bin/sh
    command:
      - -c
      - |
        sleep 5
        mc alias set local http://minio:9000 ${MINIO_ROOT_USER} ${MINIO_ROOT_PASSWORD}
        mc mb --ignore-existing local/warehouse
        mc mb --ignore-existing local/raw
        mc mb --ignore-existing local/mlflow
        mc mb --ignore-existing local/airflow-logs
        echo "Buckets created successfully"
    networks:
      - lake_net

1.3 JupyterHub(Notebook 环境)

  # =========================
  # JupyterLab (Notebook IDE)
  # =========================
  jupyter:
    image: jupyter/pyspark-notebook:spark-3.5.0
    container_name: sololakehouse-jupyter
    restart: unless-stopped
    depends_on:
      - spark-master
    environment:
      - JUPYTER_TOKEN=${JUPYTER_TOKEN}
      - SPARK_MASTER=spark://spark-master:7077
      # S3/MinIO
      - AWS_ACCESS_KEY_ID=${MINIO_ROOT_USER}
      - AWS_SECRET_ACCESS_KEY=${MINIO_ROOT_PASSWORD}
      - AWS_DEFAULT_REGION=us-east-1
      # MLflow
      - MLFLOW_TRACKING_URI=http://mlflow:5000
      - MLFLOW_S3_ENDPOINT_URL=http://minio:9000
    ports:
      - "127.0.0.1:8888:8888"
    volumes:
      - ./notebooks:/home/jovyan/work        # 用户 Notebook 持久化
      - ./spark/jars:/home/jovyan/extra-jars # Iceberg JAR
      - ./data/jupyter:/home/jovyan/.local   # pip cache
    networks:
      - lake_net
    # 用户 Spark Session 自动连接 cluster
    command: >
      start-notebook.sh
      --NotebookApp.token='${JUPYTER_TOKEN}'
      --ServerApp.root_dir=/home/jovyan/work

Notebook 模板 — 创建 notebooks/templates/spark_init.py

from pyspark.sql import SparkSession

spark = (SparkSession.builder
    .appName("sololakehouse-notebook")
    .master("spark://spark-master:7077")
    .config("spark.jars", "/home/jovyan/extra-jars/iceberg-spark-runtime-3.5_2.12-1.7.1.jar")
    .config("spark.sql.catalog.lakehouse", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.lakehouse.type", "hive")
    .config("spark.sql.catalog.lakehouse.uri", "thrift://hive-metastore:9083")
    .config("spark.sql.catalog.lakehouse.warehouse", "s3a://warehouse/")
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000")
    .config("spark.hadoop.fs.s3a.path.style.access", "true")
    .config("spark.sql.extensions",
            "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.defaultCatalog", "lakehouse")
    .getOrCreate()
)

# 测试
spark.sql("CREATE NAMESPACE IF NOT EXISTS lakehouse.bronze")
spark.sql("SHOW NAMESPACES IN lakehouse").show()

1.4 Apache Airflow(工作流编排 = Databricks Jobs)

  # =========================
  # Airflow (Workflow Orchestration)
  # =========================
  airflow-init:
    image: apache/airflow:2.10-python3.11
    container_name: sololakehouse-airflow-init
    environment: &airflow-env
      AIRFLOW__CORE__EXECUTOR: LocalExecutor
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://metastore:${PG_PASSWORD}@postgres:5432/airflow
      AIRFLOW__CORE__FERNET_KEY: ${AIRFLOW_FERNET_KEY}
      AIRFLOW__WEBSERVER__SECRET_KEY: ${AIRFLOW_SECRET_KEY}
      AIRFLOW__CORE__LOAD_EXAMPLES: "false"
      AIRFLOW__WEBSERVER__EXPOSE_CONFIG: "false"
      # S3 连接
      AWS_ACCESS_KEY_ID: ${MINIO_ROOT_USER}
      AWS_SECRET_ACCESS_KEY: ${MINIO_ROOT_PASSWORD}
    entrypoint: /bin/bash
    command:
      - -c
      - |
        airflow db init
        airflow users create \
          --username admin \
          --password ${AIRFLOW_ADMIN_PASSWORD} \
          --firstname Admin \
          --lastname User \
          --role Admin \
          --email admin@sololake.space
    networks:
      - lake_net

  airflow-webserver:
    image: apache/airflow:2.10-python3.11
    container_name: sololakehouse-airflow-webserver
    restart: unless-stopped
    depends_on:
      - postgres
    environment: *airflow-env
    ports:
      - "127.0.0.1:8085:8080"
    volumes:
      - ./airflow/dags:/opt/airflow/dags
      - ./airflow/plugins:/opt/airflow/plugins
      - ./data/airflow/logs:/opt/airflow/logs
    command: airflow webserver
    networks:
      - lake_net

  airflow-scheduler:
    image: apache/airflow:2.10-python3.11
    container_name: sololakehouse-airflow-scheduler
    restart: unless-stopped
    depends_on:
      - postgres
    environment: *airflow-env
    volumes:
      - ./airflow/dags:/opt/airflow/dags
      - ./airflow/plugins:/opt/airflow/plugins
      - ./data/airflow/logs:/opt/airflow/logs
    command: airflow scheduler
    networks:
      - lake_net
注意:Airflow 需要在 PostgreSQL 中创建单独的 airflow 数据库。在你的 postgres init 脚本中加入:

Phase 2:数据工程增强(P1,2-4 周)

2.1 Medallion 架构(Bronze → Silver → Gold)

在 MinIO 和 Iceberg 中组织数据层:

s3://warehouse/
├── bronze/          ← 原始数据(append-only,保留原貌)
│   ├── orders/
│   └── events/
├── silver/          ← 清洗后数据(去重、标准化、SCD)
│   ├── dim_customers/
│   └── fact_orders/
└── gold/            ← 业务聚合(报表层)
    ├── daily_revenue/
    └── user_cohorts/

s3://raw/            ← Landing zone(CSV/JSON/Parquet 原始文件)

对应 Iceberg namespace:

-- Spark 或 Trino 中执行
CREATE NAMESPACE IF NOT EXISTS lakehouse.bronze;
CREATE NAMESPACE IF NOT EXISTS lakehouse.silver;
CREATE NAMESPACE IF NOT EXISTS lakehouse.gold;

2.2 流处理(Redpanda + Spark Streaming)

  # =========================
  # Redpanda (Kafka-compatible streaming)
  # =========================
  redpanda:
    image: redpandadata/redpanda:latest
    container_name: sololakehouse-redpanda
    restart: unless-stopped
    command:
      - redpanda start
      - --smp 1
      - --memory 1G
      - --overprovisioned
      - --node-id 0
      - --kafka-addr PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092
      - --advertise-kafka-addr PLAINTEXT://redpanda:29092,OUTSIDE://localhost:9092
      - --pandaproxy-addr 0.0.0.0:8082
      - --advertise-pandaproxy-addr redpanda:8082
    expose:
      - "29092"
      - "9092"
    volumes:
      - ./data/redpanda:/var/lib/redpanda/data
    networks:
      - lake_net

  # Redpanda Console (可选的 Kafka UI)
  redpanda-console:
    image: redpandadata/console:latest
    container_name: sololakehouse-redpanda-console
    restart: unless-stopped
    depends_on:
      - redpanda
    environment:
      KAFKA_BROKERS: redpanda:29092
    ports:
      - "127.0.0.1:8084:8080"
    networks:
      - lake_net

Spark Structured Streaming 示例(写入 Iceberg):

# 从 Kafka/Redpanda 读取 → 写入 Iceberg Bronze 表
stream_df = (spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "redpanda:29092")
    .option("subscribe", "raw_events")
    .load()
)

parsed = stream_df.selectExpr(
    "CAST(key AS STRING)",
    "CAST(value AS STRING)",
    "topic", "partition", "offset", "timestamp"
)

(parsed.writeStream
    .format("iceberg")
    .outputMode("append")
    .option("checkpointLocation", "s3a://warehouse/_checkpoints/raw_events")
    .toTable("lakehouse.bronze.raw_events")
)

2.3 数据目录升级(可选:Apache Gravitino)

如果你需要类似 Unity Catalog 的跨引擎统一目录:

  # =========================
  # Apache Gravitino (Unity Catalog 替代)
  # =========================
  gravitino:
    image: apache/gravitino:0.7.0
    container_name: sololakehouse-gravitino
    restart: unless-stopped
    ports:
      - "127.0.0.1:8090:8090"   # 注意与 beszel 端口冲突,需调整
    environment:
      - GRAVITINO_SERVER_WEBUI_ENABLE=true
    volumes:
      - ./data/gravitino:/opt/gravitino/data
    networks:
      - lake_net
建议:Phase 1 先用 Hive Metastore,足以支撑 Spark + Trino。等规模变大后再考虑 Gravitino。

Phase 3:ML/AI 平台增强(P2,4-6 周)

3.1 Feature Store(Feast)

  # Feast 不需要独立服务,在 Jupyter 中安装即可
  # pip install feast[aws]
  # 配置 feature_store.yaml 指向 MinIO + PostgreSQL

feast/feature_store.yaml

project: sololakehouse
provider: local
registry:
  registry_type: sql
  path: postgresql://metastore:${PG_PASSWORD}@postgres:5432/feast
offline_store:
  type: file      # 或自定义 Spark offline store
online_store:
  type: sqlite
  path: /data/feast/online_store.db

3.2 模型推理服务

  # =========================
  # MLflow Model Serving (简单方案)
  # =========================
  mlflow-serve:
    image: ghcr.io/mlflow/mlflow:latest
    container_name: sololakehouse-mlflow-serve
    restart: unless-stopped
    depends_on:
      - mlflow
      - minio
    environment:
      MLFLOW_TRACKING_URI: http://mlflow:5000
      MLFLOW_S3_ENDPOINT_URL: http://minio:9000
      AWS_ACCESS_KEY_ID: ${MINIO_ROOT_USER}
      AWS_SECRET_ACCESS_KEY: ${MINIO_ROOT_PASSWORD}
    ports:
      - "127.0.0.1:5001:5001"
    command:
      - mlflow
      - models
      - serve
      - --model-uri=models:/production-model/latest
      - --host=0.0.0.0
      - --port=5001
      - --no-conda
    networks:
      - lake_net
对于更复杂的推理场景,可替换为 BentoMLRay Serve

四、新增环境变量(.env 文件追加)

# ---- Spark ----
# (使用已有的 MINIO_ROOT_USER / MINIO_ROOT_PASSWORD)

# ---- Jupyter ----
JUPYTER_TOKEN=your-secure-jupyter-token

# ---- Airflow ----
AIRFLOW_FERNET_KEY=   # python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
AIRFLOW_SECRET_KEY=   # openssl rand -hex 32
AIRFLOW_ADMIN_PASSWORD=your-airflow-password

五、NPM 反代新增子域名

子域名 内网目标 用途
solo-spark.sololake.space spark-master:8081 Spark Master UI
solo-jupyter.sololake.space jupyter:8888 Notebook
solo-airflow.sololake.space airflow-webserver:8080 工作流调度
solo-redpanda.sololake.space redpanda-console:8080 Kafka UI
solo-minio.sololake.space minio:9001 MinIO Console

六、目录结构

sololakehouse/
├── docker-compose.yml
├── .env
│
├── spark/
│   └── jars/                    # Iceberg, S3 JARs
│
├── trino/
│   └── etc/
│       └── catalog/
│           ├── iceberg.properties    # 新增
│           └── hive.properties       # 已有
│
├── airflow/
│   ├── dags/                    # DAG 定义
│   └── plugins/
│
├── notebooks/
│   ├── templates/               # Spark init 模板
│   ├── bronze/                  # 数据摄入 Notebook
│   ├── silver/                  # 数据清洗 Notebook
│   └── gold/                    # 聚合分析 Notebook
│
├── hive/
│   ├── conf/
│   └── lib/
│
├── monitoring/
│   └── prometheus/
│
├── docs/
│
└── data/                        # 所有持久化数据
    ├── minio/
    ├── postgres/
    ├── spark/
    │   └── events/
    ├── airflow/
    │   └── logs/
    ├── jupyter/
    ├── redpanda/
    └── ...

七、端到端数据流示例

下面演示一个完整的 Databricks 式工作流:从原始数据 → 清洗 → 聚合 → ML训练 → 模型注册 → 推理

          ┌──────────────┐
          │  数据源       │  CSV / API / Kafka
          └──────┬───────┘
                 │
          ① Airflow DAG 触发 Spark Job
                 │
          ┌──────▼───────┐
          │ Bronze 层     │  Spark 写入 Iceberg (append-only)
          │ lakehouse.    │  s3://warehouse/bronze/
          │ bronze.*      │
          └──────┬───────┘
                 │
          ② Airflow DAG 触发清洗 Job
                 │
          ┌──────▼───────┐
          │ Silver 层     │  Spark MERGE INTO (去重、SCD Type 2)
          │ lakehouse.    │  s3://warehouse/silver/
          │ silver.*      │
          └──────┬───────┘
                 │
          ③ Airflow DAG 触发聚合 Job
                 │
          ┌──────▼───────┐
          │ Gold 层       │  Spark/Trino 聚合指标
          │ lakehouse.    │  s3://warehouse/gold/
          │ gold.*        │
          └──────┬───────┘
                 │
        ┌────────┴────────┐
        │                 │
   ④ Trino SQL         ⑤ Jupyter Notebook
   (ad-hoc 分析)       (ML 训练 + MLflow 追踪)
        │                 │
   CloudBeaver        MLflow Model Registry
                          │
                     ⑥ MLflow Serve
                     (REST API 推理)

八、资源需求估算

配置 最低要求 推荐配置
CPU 8 cores 16+ cores
RAM 32 GB 64+ GB
Disk 100 GB SSD 500 GB+ NVMe
适用场景 开发/学习 小团队生产

各服务内存分配参考(32GB 机器)

服务 内存
Spark Master 1 GB
Spark Worker ×1 8 GB
Jupyter 4 GB
Trino 4 GB
PostgreSQL 1 GB
Hive Metastore 1 GB
Airflow (Web+Scheduler) 2 GB
MinIO 1 GB
Redpanda 1 GB
其他(NPM, Grafana, etc) 3 GB
操作系统预留 6 GB

九、实施 Checklist

Phase 1(第 1-2 周)

  • [ ] 下载 Iceberg / S3 JAR 包到 spark/jars/
  • [ ] 在 PostgreSQL 中创建 airflow 数据库
  • [ ] 添加 Spark Master + Worker 到 docker-compose
  • [ ] 添加 JupyterLab 到 docker-compose
  • [ ] 配置 Trino Iceberg catalog
  • [ ] 添加 MinIO init 容器创建 bucket
  • [ ] 添加 Airflow (webserver + scheduler) 到 docker-compose
  • [ ] 生成 Airflow Fernet Key 和 Secret Key
  • [ ] 配置 NPM 反代新子域名
  • [ ] 在 Jupyter 中测试 Spark → Iceberg → MinIO 链路
  • [ ] 在 Trino 中验证可读取 Iceberg 表
  • [ ] 编写第一个 Airflow DAG(Bronze ingestion)

Phase 2(第 2-4 周)

  • [ ] 建立 Medallion 架构 namespace
  • [ ] 添加 Redpanda 到 docker-compose
  • [ ] 实现 Spark Structured Streaming → Iceberg
  • [ ] 编写 Silver 层 MERGE INTO DAG
  • [ ] 编写 Gold 层聚合 DAG

Phase 3(第 4-6 周)

  • [ ] 安装配置 Feast Feature Store
  • [ ] 在 Jupyter 中完成端到端 ML pipeline
  • [ ] 配置 MLflow Model Serving
  • [ ] Grafana 添加 Spark metrics dashboard
  • [ ] 更新 MkDocs 文档

十、与 Databricks 功能对照表(最终)

Databricks SoloLakehouse 备注
Spark Compute Spark Standalone (Bitnami) 可选升级 K8s
Databricks Notebooks JupyterLab PySpark + MLflow 集成
Unity Catalog Hive Metastore (→ Gravitino) Phase 1 Hive 够用
Delta Lake Apache Iceberg 多引擎更友好
Databricks SQL Trino + CloudBeaver 接近体验
Workflows / Jobs Apache Airflow 更灵活
MLflow (managed) MLflow (self-hosted) 完全相同
Auto Loader Airflow + Spark ingest 需自行编写
Structured Streaming Spark SS + Redpanda 完整流处理
Feature Store Feast 离线/在线
Model Serving MLflow Serve / BentoML REST API
DBFS MinIO S3-compatible
Monitoring Prometheus + Grafana 更灵活
Secrets .env (→ Vault) 按需升级