SoloLakehouse v2 - Complete Build Specification

Purpose: This document is a fully executable specification for a code agent to build SoloLakehouse v2 from scratch. Every file that must be created is listed with its exact path and complete contents. Follow the sections in order. Do not skip steps.

Base: The existing docker-compose.yml (v1) contains: Nginx Proxy Manager, MinIO, Postgres 16, Hive Metastore 3.1.3, Trino 455, CloudBeaver, MLflow (sqlite), Beszel, Prometheus, Grafana, Node Exporter, cAdvisor, MkDocs. This spec upgrades and extends that base into a complete Databricks-equivalent open-source AI Lakehouse.

Table of Contents

  1. Architecture Overview
  2. Repository Layout
  3. Environment Variables — .env
  4. Postgres Initialization
  5. Nessie Catalog Configuration
  6. Trino Configuration
  7. Spark Configuration
  8. Airflow DAGs and Configuration
  9. MLflow — Upgrade to Postgres Backend
  10. Feast Feature Store
  11. BentoML Model Serving
  12. Qdrant Vector Database
  13. Marquez — Data Lineage
  14. Evidently AI — ML Monitoring
  15. Jupyter Lab — Development Environment
  16. Prometheus Scrape Configuration
  17. Grafana Provisioning
  18. Complete docker-compose.yml v2
  19. MinIO Bucket Bootstrap Script
  20. Airflow Bootstrap DAG — Frankfurt Airport Taxi Pipeline
  21. Sample Notebook — End-to-End Platform Smoke Test
  22. Startup Sequence
  23. Verification Checklist

1. Architecture Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                        SoloLakehouse v2                                  │
│                                                                          │
│  INGESTION          STORAGE            CATALOG         QUERY             │
│  ─────────          ───────            ───────         ─────             │
│  Airflow ──────▶   MinIO (S3)   ◀────  Nessie   ◀──── Trino             │
│  (ETL/ELT)          (Iceberg            (REST            (SQL)           │
│                      Parquet)            Catalog)                        │
│                                                                          │
│  COMPUTE            ML PLATFORM         AI LAYER        GOVERNANCE       │
│  ───────            ───────────         ────────        ──────────       │
│  Spark ──────▶     MLflow        ──▶   Qdrant    ──▶   Marquez           │
│  (Batch/Stream)     (Tracking +         (Vector          (Lineage)       │
│                      Registry)           Search)                         │
│                     Feast               BentoML          Evidently       │
│                     (Feature Store)     (Serving)        (ML Monitor)    │
│                                                                          │
│  OBSERVABILITY      DEVELOPER UX                                         │
│  ─────────────      ────────────                                         │
│  Prometheus         Jupyter Lab                                          │
│  Grafana            CloudBeaver                                          │
│  Beszel             MkDocs                                               │
│  Node Exporter                                                           │
│  cAdvisor                                                                │
│                                                                          │
│  REVERSE PROXY: Nginx Proxy Manager (NPM) — all services via subdomain   │
└─────────────────────────────────────────────────────────────────────────┘

Service Port Map

Service Internal Port Localhost Binding NPM Subdomain (example)
NPM Admin 81 127.0.0.1:81
MinIO API 9000 expose only
MinIO Console 9001 expose only solo-minio.sololake.space
Trino 8080 127.0.0.1:8080 solo-trino.sololake.space
CloudBeaver 8978 expose only solo-dbeaver.sololake.space
MLflow 5000 127.0.0.1:5000 solo-mlflow.sololake.space
Airflow 8080 127.0.0.1:8083 solo-airflow.sololake.space
Spark Master UI 8080 127.0.0.1:4040 solo-spark.sololake.space
Jupyter Lab 8888 127.0.0.1:8888 solo-jupyter.sololake.space
Feast 6566 127.0.0.1:6566 solo-feast.sololake.space
BentoML 3000 127.0.0.1:3001 solo-bentoml.sololake.space
Qdrant REST 6333 127.0.0.1:6333 solo-qdrant.sololake.space
Marquez UI 3000 127.0.0.1:3002 solo-marquez.sololake.space
Evidently 8085 127.0.0.1:8085 solo-evidently.sololake.space
Nessie 19120 expose only solo-nessie.sololake.space
Grafana 3000 127.0.0.1:3000 solo-grafana.sololake.space
Prometheus 9090 127.0.0.1:9090 solo-prometheus.sololake.space
Beszel 8090 127.0.0.1:8090 solo-beszel.sololake.space
MkDocs 8000 127.0.0.1:8000 solo-docs.sololake.space

2. Repository Layout

The agent must create the following directory and file structure before writing any file contents.

sololakehouse/
├── .env                                   # All secrets and config vars
├── docker-compose.yml                     # Complete v2 compose file
├── scripts/
│   └── bootstrap.sh                       # One-shot init script
│
├── postgres/
│   ├── pg_hba.conf                        # KEEP existing file unchanged
│   └── init/
│       └── 01_init_databases.sql          # CREATE all required databases
│
├── nessie/
│   └── application.properties            # Nessie Quarkus config (optional override)
│
├── trino/
│   └── etc/
│       ├── config.properties
│       ├── jvm.config
│       ├── node.properties
│       └── catalog/
│           ├── iceberg.properties         # NEW: Iceberg + Nessie REST catalog
│           └── tpch.properties            # Keep for testing
│
├── spark/
│   └── conf/
│       ├── spark-defaults.conf            # Iceberg + MinIO S3A config
│       └── log4j2.properties
│
├── airflow/
│   ├── dags/
│   │   ├── fra_taxi_bronze_ingestion.py   # Demo pipeline DAG
│   │   └── fra_taxi_silver_transform.py
│   └── plugins/
│       └── openlineage_plugin.py          # Auto-inject lineage to Marquez
│
├── feast/
│   ├── feature_store.yaml                 # Feast project config
│   └── features/
│       └── taxi_features.py               # Feature definitions
│
├── bentoml/
│   └── services/
│       └── taxi_predictor.py              # Example BentoML service
│
├── marquez/
│   └── marquez.yml                        # Marquez server config
│
├── evidently/
│   └── config.yml                         # Evidently service config
│
├── monitoring/
│   └── prometheus/
│       └── prometheus.yml                 # UPDATED: adds all new scrape targets
│
├── grafana/
│   └── provisioning/
│       ├── datasources/
│       │   └── prometheus.yml
│       └── dashboards/
│           ├── dashboard.yml
│           └── sololakehouse_overview.json
│
├── notebooks/
│   └── 00_platform_smoke_test.ipynb       # End-to-end verification notebook
│
├── data/                                  # Runtime data (gitignored)
└── logs/                                  # Runtime logs (gitignored)

3. Environment Variables — .env

File: .env

The agent must create this file. All ${VAR} references in docker-compose.yml resolve here. Replace placeholder values with real secrets before running.
# ============================================================
# SoloLakehouse v2 — Environment Configuration
# ============================================================

# --- Object Storage (MinIO) ---
MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=CHANGE_ME_minio_secret

# --- Postgres Master Password ---
# Used for ALL database users (nessie, mlflow, airflow, feast, marquez)
PG_PASSWORD=CHANGE_ME_postgres_secret

# --- Grafana ---
GRAFANA_ADMIN_PASSWORD=CHANGE_ME_grafana_secret

# --- Airflow ---
# Generate: python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
AIRFLOW_FERNET_KEY=CHANGE_ME_run_above_command
# Generate: openssl rand -hex 32
AIRFLOW_SECRET_KEY=CHANGE_ME_run_above_command
AIRFLOW_ADMIN_PASSWORD=CHANGE_ME_airflow_secret

# --- Jupyter ---
JUPYTER_TOKEN=CHANGE_ME_jupyter_token

# --- Beszel (keep existing values) ---
BESZEL_TOKEN=a69c9561-adc8-4513-a4fa-92b109a80b6c
BESZEL_KEY=ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJisOmj8sQ1+uOmGjfF2mV1VEI5NurR4DI1BNXU3by2N

# --- Domain (used in MLflow allowed-hosts and NPM) ---
BASE_DOMAIN=sololake.space

4. Postgres Initialization

File: postgres/init/01_init_databases.sql

Postgres runs this file automatically on first container start (mount as /docker-entrypoint-initdb.d/). Creates all required databases and users with the single PG_PASSWORD. The existing metastore database is created by the Postgres POSTGRES_DB env var — do NOT recreate it here.
-- ============================================================
-- SoloLakehouse v2 — Database Initialization
-- Runs once on first `docker compose up`
-- ============================================================

-- Nessie catalog metadata
CREATE DATABASE nessie;
CREATE USER nessie WITH PASSWORD 'CHANGE_ME_postgres_secret';
GRANT ALL PRIVILEGES ON DATABASE nessie TO nessie;

-- MLflow experiment tracking (replaces sqlite)
CREATE DATABASE mlflow;
CREATE USER mlflow WITH PASSWORD 'CHANGE_ME_postgres_secret';
GRANT ALL PRIVILEGES ON DATABASE mlflow TO mlflow;

-- Airflow metadata
CREATE DATABASE airflow;
CREATE USER airflow WITH PASSWORD 'CHANGE_ME_postgres_secret';
GRANT ALL PRIVILEGES ON DATABASE airflow TO airflow;

-- Feast feature registry
CREATE DATABASE feast;
CREATE USER feast WITH PASSWORD 'CHANGE_ME_postgres_secret';
GRANT ALL PRIVILEGES ON DATABASE feast TO feast;

-- Marquez data lineage
CREATE DATABASE marquez;
CREATE USER marquez WITH PASSWORD 'CHANGE_ME_postgres_secret';
GRANT ALL PRIVILEGES ON DATABASE marquez TO marquez;
IMPORTANT: The agent must also update postgres/pg_hba.conf to allow md5 auth for all new users. Add the following lines to the existing pg_hba.conf, preserving the original content:
# SoloLakehouse v2 additional users
host    nessie          nessie          0.0.0.0/0               md5
host    mlflow          mlflow          0.0.0.0/0               md5
host    airflow         airflow         0.0.0.0/0               md5
host    feast           feast           0.0.0.0/0               md5
host    marquez         marquez         0.0.0.0/0               md5

5. Nessie Catalog Configuration

Project Nessie is the Git-like Iceberg REST catalog that replaces Hive Metastore. It stores metadata in Postgres and exposes an Iceberg REST endpoint consumed by Trino and Spark.

File: nessie/application.properties

# Nessie version store — backed by Postgres via JDBC
nessie.version.store.type=JDBC
quarkus.datasource.jdbc.url=jdbc:postgresql://postgres:5432/nessie
quarkus.datasource.username=nessie
quarkus.datasource.password=CHANGE_ME_postgres_secret

# HTTP
quarkus.http.port=19120
quarkus.http.host=0.0.0.0

# CORS — allow Trino, Spark, Jupyter from within docker network
quarkus.http.cors=true
quarkus.http.cors.origins=*

# Logging
quarkus.log.level=INFO
quarkus.log.category."org.projectnessie".level=INFO
Note: Nessie 0.95.0+ reads application.properties from the classpath or from a volume-mounted /deployments/config/application.properties. Mount path: ./nessie/application.properties:/deployments/config/application.properties

6. Trino Configuration

6.1 trino/etc/config.properties

Keep the existing file. No changes required unless you want to increase memory.
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=4GB
query.max-memory-per-node=2GB
discovery.uri=http://localhost:8080

6.2 trino/etc/jvm.config

-server
-Xmx6G
-XX:InitialRAMPercentage=80
-XX:MaxRAMPercentage=80
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
-XX:ReservedCodeCacheSize=512M
-Djdk.attach.allowAttachSelf=true
-Dfile.encoding=UTF-8

6.3 trino/etc/node.properties

node.environment=sololakehouse
node.id=trino-node-1
node.data-dir=/data/trino

6.4 trino/etc/catalog/iceberg.properties ← NEW FILE

# Iceberg connector backed by Nessie REST catalog
connector.name=iceberg

# Catalog type: REST (Nessie implements the Iceberg REST spec)
iceberg.catalog.type=rest
iceberg.rest-catalog.uri=http://nessie:19120/api/v1
iceberg.rest-catalog.warehouse=s3://lakehouse/

# S3-compatible storage via MinIO
fs.native-s3.enabled=true
s3.endpoint=http://minio:9000
s3.region=us-east-1
s3.path-style-access=true
s3.aws-access-key=${ENV:MINIO_ROOT_USER}
s3.aws-secret-key=${ENV:MINIO_ROOT_PASSWORD}

# Performance
iceberg.file-format=PARQUET
iceberg.compression-codec=ZSTD

6.5 trino/etc/catalog/tpch.properties — keep existing

connector.name=tpch

7. Spark Configuration

7.1 spark/conf/spark-defaults.conf

# ── Iceberg Runtime ──────────────────────────────────────────────────────────
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.nessie=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.nessie.catalog-impl=org.apache.iceberg.nessie.NessieCatalog
spark.sql.catalog.nessie.uri=http://nessie:19120/api/v1
spark.sql.catalog.nessie.ref=main
spark.sql.catalog.nessie.warehouse=s3://lakehouse/
spark.sql.catalog.nessie.authentication.type=NONE

# ── MinIO / S3A ──────────────────────────────────────────────────────────────
spark.hadoop.fs.s3a.endpoint=http://minio:9000
spark.hadoop.fs.s3a.access.key=${MINIO_ROOT_USER}
spark.hadoop.fs.s3a.secret.key=${MINIO_ROOT_PASSWORD}
spark.hadoop.fs.s3a.path.style.access=true
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.connection.ssl.enabled=false

# ── MLflow Integration ───────────────────────────────────────────────────────
spark.mlflow.trackingUri=http://mlflow:5000

# ── OpenLineage (Marquez) ────────────────────────────────────────────────────
spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener
spark.openlineage.transport.type=http
spark.openlineage.transport.url=http://marquez:5000
spark.openlineage.namespace=sololakehouse

# ── JARs (must be present in spark image or mounted) ────────────────────────
# These are fetched by the custom Spark image defined in docker-compose
spark.jars.packages=org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1,\
  org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.95.0,\
  org.apache.hadoop:hadoop-aws:3.3.4,\
  com.amazonaws:aws-java-sdk-bundle:1.12.262,\
  io.openlineage:openlineage-spark_2.12:1.22.0

7.2 spark/conf/log4j2.properties

rootLogger.level=WARN
rootLogger.appenderRef.console.ref=ConsoleAppender

appender.console.type=Console
appender.console.name=ConsoleAppender
appender.console.layout.type=PatternLayout
appender.console.layout.pattern=%d{HH:mm:ss.SSS} [%t] %-5level %logger{36} - %msg%n

logger.iceberg.name=org.apache.iceberg
logger.iceberg.level=INFO

logger.spark.name=org.apache.spark
logger.spark.level=WARN

8. Airflow DAGs and Configuration

8.1 Airflow Environment Notes

Airflow runs with LocalExecutor backed by Postgres. The webserver and scheduler are separate containers sharing the same DAG volume mount. Both containers run from the same apache/airflow:2.10.0 image with additional pip packages installed via environment variable.

Required additional pip packages (set via _PIP_ADDITIONAL_REQUIREMENTS env var):

apache-airflow-providers-apache-spark==4.9.0
apache-airflow-providers-trino==5.7.0
openlineage-airflow==1.22.0
boto3==1.34.0
pyiceberg[glue,s3]==0.7.1

8.2 airflow/dags/fra_taxi_bronze_ingestion.py

This DAG simulates the Bronze ingestion layer for the Frankfurt Airport taxi time prediction project. In production, replace the synthetic data generator with real ADS-B feed or CSV uploads to MinIO.

"""
DAG: fra_taxi_bronze_ingestion
Layer: Bronze (raw ingestion)
Schedule: Daily at 02:00 UTC
Purpose: Ingest raw Frankfurt Airport taxi records into Iceberg Bronze table.
Lineage: Emits OpenLineage events to Marquez.
"""

from __future__ import annotations

import json
import random
from datetime import datetime, timedelta

import boto3
from airflow import DAG
from airflow.operators.python import PythonOperator

# ── Constants ────────────────────────────────────────────────────────────────
MINIO_ENDPOINT   = "http://minio:9000"
MINIO_ACCESS_KEY = "{{ var.value.minio_access_key }}"   # Set in Airflow Variables UI
MINIO_SECRET_KEY = "{{ var.value.minio_secret_key }}"
BRONZE_BUCKET    = "lakehouse"
BRONZE_PREFIX    = "bronze/fra_taxi_raw/"

DEFAULT_ARGS = {
    "owner": "sololakehouse",
    "retries": 2,
    "retry_delay": timedelta(minutes=5),
    "email_on_failure": False,
}


def generate_synthetic_taxi_records(**context) -> None:
    """
    Generate synthetic FRA taxi time records and upload as newline-delimited JSON to MinIO.
    Each record represents one aircraft taxi event from gate to runway or reverse.
    """
    execution_date = context["ds"]
    n_records = 200  # ~200 movements per day at FRA

    gates   = ["A01","A10","A22","B12","B20","C04","Z12","Z14","Z60"]
    runways = ["07C","07L","07R","25C","25L","25R"]

    records = []
    for i in range(n_records):
        record = {
            "event_id":         f"FRA-{execution_date}-{i:04d}",
            "execution_date":   execution_date,
            "aircraft_type":    random.choice(["A320","B737","A380","B777","E190","A350"]),
            "gate":             random.choice(gates),
            "runway":           random.choice(runways),
            "taxi_direction":   random.choice(["outbound", "inbound"]),
            "taxi_time_sec":    random.randint(120, 1800),
            "wind_speed_kt":    round(random.uniform(0, 40), 1),
            "wind_dir_deg":     random.randint(0, 359),
            "temperature_c":    round(random.uniform(-15, 38), 1),
            "visibility_m":     random.choice([800, 1500, 3000, 5000, 9999]),
            "hour_utc":         random.randint(5, 23),
            "ingested_at":      datetime.utcnow().isoformat(),
        }
        records.append(json.dumps(record))

    payload = "\n".join(records)
    key     = f"{BRONZE_PREFIX}{execution_date}/records.jsonl"

    s3 = boto3.client(
        "s3",
        endpoint_url=MINIO_ENDPOINT,
        aws_access_key_id=MINIO_ACCESS_KEY,
        aws_secret_access_key=MINIO_SECRET_KEY,
    )
    s3.put_object(Bucket=BRONZE_BUCKET, Key=key, Body=payload.encode("utf-8"))
    print(f"[bronze] Uploaded {n_records} records to s3://{BRONZE_BUCKET}/{key}")


def register_iceberg_table_if_missing(**context) -> None:
    """
    Create Iceberg Bronze table in Nessie catalog via Trino if it does not already exist.
    This is idempotent — safe to run on every DAG execution.
    """
    import trino

    conn = trino.dbapi.connect(
        host="trino",
        port=8080,
        user="airflow",
        catalog="iceberg",
        schema="bronze",
    )
    cur = conn.cursor()

    cur.execute("CREATE SCHEMA IF NOT EXISTS iceberg.bronze WITH (location = 's3://lakehouse/bronze/')")
    cur.execute("""
        CREATE TABLE IF NOT EXISTS iceberg.bronze.fra_taxi_raw (
            event_id        VARCHAR,
            execution_date  VARCHAR,
            aircraft_type   VARCHAR,
            gate            VARCHAR,
            runway          VARCHAR,
            taxi_direction  VARCHAR,
            taxi_time_sec   INTEGER,
            wind_speed_kt   DOUBLE,
            wind_dir_deg    INTEGER,
            temperature_c   DOUBLE,
            visibility_m    INTEGER,
            hour_utc        INTEGER,
            ingested_at     VARCHAR
        )
        WITH (
            format            = 'PARQUET',
            partitioning      = ARRAY['execution_date'],
            location          = 's3://lakehouse/bronze/fra_taxi_raw/'
        )
    """)
    print("[bronze] Iceberg table iceberg.bronze.fra_taxi_raw is ready.")
    conn.close()


def load_jsonl_to_iceberg(**context) -> None:
    """
    Read the JSONL file from MinIO and INSERT rows into the Iceberg Bronze table via Trino.
    """
    import trino
    import boto3

    execution_date = context["ds"]
    key            = f"{BRONZE_PREFIX}{execution_date}/records.jsonl"

    s3 = boto3.client(
        "s3",
        endpoint_url=MINIO_ENDPOINT,
        aws_access_key_id=MINIO_ACCESS_KEY,
        aws_secret_access_key=MINIO_SECRET_KEY,
    )
    body    = s3.get_object(Bucket=BRONZE_BUCKET, Key=key)["Body"].read().decode()
    records = [json.loads(line) for line in body.splitlines() if line.strip()]

    conn = trino.dbapi.connect(host="trino", port=8080, user="airflow",
                               catalog="iceberg", schema="bronze")
    cur  = conn.cursor()

    batch = []
    for r in records:
        batch.append(
            f"('{r['event_id']}','{r['execution_date']}','{r['aircraft_type']}',"
            f"'{r['gate']}','{r['runway']}','{r['taxi_direction']}',"
            f"{r['taxi_time_sec']},{r['wind_speed_kt']},{r['wind_dir_deg']},"
            f"{r['temperature_c']},{r['visibility_m']},{r['hour_utc']},'{r['ingested_at']}')"
        )

    values_sql = ",\n".join(batch)
    cur.execute(f"""
        INSERT INTO iceberg.bronze.fra_taxi_raw VALUES
        {values_sql}
    """)
    print(f"[bronze] Inserted {len(records)} rows into iceberg.bronze.fra_taxi_raw "
          f"partition execution_date={execution_date}")
    conn.close()


with DAG(
    dag_id="fra_taxi_bronze_ingestion",
    default_args=DEFAULT_ARGS,
    description="Bronze layer: ingest FRA taxi raw records into Iceberg",
    schedule_interval="0 2 * * *",
    start_date=datetime(2024, 1, 1),
    catchup=False,
    tags=["bronze", "fra", "taxi", "sololakehouse"],
) as dag:

    t1 = PythonOperator(task_id="generate_records",      python_callable=generate_synthetic_taxi_records)
    t2 = PythonOperator(task_id="ensure_iceberg_table",  python_callable=register_iceberg_table_if_missing)
    t3 = PythonOperator(task_id="load_to_iceberg",       python_callable=load_jsonl_to_iceberg)

    t1 >> t2 >> t3

8.3 airflow/dags/fra_taxi_silver_transform.py

"""
DAG: fra_taxi_silver_transform
Layer: Silver (cleaned, feature-enriched)
Schedule: Daily at 04:00 UTC (after bronze)
Purpose: Clean bronze data, add derived features, write to Silver Iceberg table.
"""

from __future__ import annotations

from datetime import datetime, timedelta

from airflow import DAG
from airflow.operators.python import PythonOperator

DEFAULT_ARGS = {
    "owner": "sololakehouse",
    "retries": 2,
    "retry_delay": timedelta(minutes=5),
    "email_on_failure": False,
}

SILVER_DDL = """
    CREATE TABLE IF NOT EXISTS iceberg.silver.fra_taxi_features (
        event_id            VARCHAR,
        execution_date      VARCHAR,
        aircraft_type       VARCHAR,
        gate                VARCHAR,
        runway              VARCHAR,
        taxi_direction      VARCHAR,
        taxi_time_sec       INTEGER,
        taxi_time_min       DOUBLE,
        wind_speed_kt       DOUBLE,
        wind_dir_deg        INTEGER,
        temperature_c       DOUBLE,
        visibility_m        INTEGER,
        hour_utc            INTEGER,
        is_peak_hour        BOOLEAN,
        is_low_visibility   BOOLEAN,
        crosswind_component DOUBLE,
        ingested_at         VARCHAR,
        transformed_at      VARCHAR
    )
    WITH (
        format       = 'PARQUET',
        partitioning = ARRAY['execution_date'],
        location     = 's3://lakehouse/silver/fra_taxi_features/'
    )
"""

TRANSFORM_SQL = """
    INSERT INTO iceberg.silver.fra_taxi_features
    SELECT
        event_id,
        execution_date,
        aircraft_type,
        gate,
        runway,
        taxi_direction,
        taxi_time_sec,
        CAST(taxi_time_sec AS DOUBLE) / 60.0                           AS taxi_time_min,
        wind_speed_kt,
        wind_dir_deg,
        temperature_c,
        visibility_m,
        hour_utc,
        hour_utc BETWEEN 6 AND 10 OR hour_utc BETWEEN 16 AND 20       AS is_peak_hour,
        visibility_m < 1500                                            AS is_low_visibility,
        -- Crosswind component approximation (simplified)
        ABS(wind_speed_kt * SIN(RADIANS(wind_dir_deg - 70)))          AS crosswind_component,
        ingested_at,
        CAST(NOW() AS VARCHAR)                                         AS transformed_at
    FROM iceberg.bronze.fra_taxi_raw
    WHERE execution_date = '{execution_date}'
      AND taxi_time_sec > 0
      AND taxi_time_sec < 7200
"""


def ensure_silver_table(**context) -> None:
    import trino
    conn = trino.dbapi.connect(host="trino", port=8080, user="airflow",
                               catalog="iceberg", schema="silver")
    cur  = conn.cursor()
    cur.execute("CREATE SCHEMA IF NOT EXISTS iceberg.silver WITH (location = 's3://lakehouse/silver/')")
    cur.execute(SILVER_DDL)
    conn.close()


def transform_to_silver(**context) -> None:
    import trino
    execution_date = context["ds"]
    conn = trino.dbapi.connect(host="trino", port=8080, user="airflow",
                               catalog="iceberg", schema="silver")
    cur  = conn.cursor()
    cur.execute(TRANSFORM_SQL.format(execution_date=execution_date))
    conn.close()
    print(f"[silver] Transformed records for {execution_date}")


with DAG(
    dag_id="fra_taxi_silver_transform",
    default_args=DEFAULT_ARGS,
    description="Silver layer: clean + feature engineer FRA taxi data",
    schedule_interval="0 4 * * *",
    start_date=datetime(2024, 1, 1),
    catchup=False,
    tags=["silver", "fra", "taxi", "sololakehouse"],
) as dag:

    t1 = PythonOperator(task_id="ensure_silver_table",  python_callable=ensure_silver_table)
    t2 = PythonOperator(task_id="transform_to_silver",  python_callable=transform_to_silver)

    t1 >> t2

9. MLflow — Upgrade to Postgres Backend

No separate config file is needed. The upgrade is done entirely via the docker-compose service definition (see Section 18). The key change is:

  • --backend-store-uri switches from sqlite:////mlflow/mlflow.db to postgresql://mlflow:${PG_PASSWORD}@postgres:5432/mlflow
  • The existing ./data/mlflow volume is kept but the sqlite file is no longer used.
  • Add pip install psycopg2-binary boto3 to the startup command.

10. Feast Feature Store

10.1 feast/feature_store.yaml

project: sololakehouse
registry:
  registry_type: sql
  path: postgresql+psycopg2://feast:CHANGE_ME_postgres_secret@postgres:5432/feast
  cache_ttl_seconds: 60

provider: local

online_store:
  type: redis
  connection_string: "redis:6379,db=0"

offline_store:
  type: trino
  host: trino
  port: 8080
  catalog: iceberg
  connector:
    type: iceberg

entity_key_serialization_version: 2

10.2 feast/features/taxi_features.py

"""
Feast Feature Definitions for FRA Taxi Time Prediction.
Run `feast apply` from the /feast directory to register these features.
"""

from datetime import timedelta

from feast import Entity, Feature, FeatureView, Field, FileSource
from feast.types import Float64, Int64, Bool, String

# ── Entity ───────────────────────────────────────────────────────────────────
taxi_event = Entity(
    name="event_id",
    description="Unique identifier for a single aircraft taxi event at FRA",
)

# ── Source (Iceberg table via Trino offline store) ────────────────────────────
fra_silver_source = FileSource(
    # In production: use TrinoSource pointing to iceberg.silver.fra_taxi_features
    # For local development, this points to a parquet export
    path="s3://lakehouse/silver/fra_taxi_features/",
    timestamp_field="ingested_at",
    s3_endpoint_override="http://minio:9000",
)

# ── Feature View ──────────────────────────────────────────────────────────────
fra_taxi_feature_view = FeatureView(
    name="fra_taxi_features",
    entities=[taxi_event],
    ttl=timedelta(days=90),
    schema=[
        Field(name="taxi_time_sec",         dtype=Int64),
        Field(name="taxi_time_min",         dtype=Float64),
        Field(name="wind_speed_kt",         dtype=Float64),
        Field(name="wind_dir_deg",          dtype=Int64),
        Field(name="temperature_c",         dtype=Float64),
        Field(name="visibility_m",          dtype=Int64),
        Field(name="hour_utc",              dtype=Int64),
        Field(name="is_peak_hour",          dtype=Bool),
        Field(name="is_low_visibility",     dtype=Bool),
        Field(name="crosswind_component",   dtype=Float64),
        Field(name="aircraft_type",         dtype=String),
        Field(name="taxi_direction",        dtype=String),
    ],
    source=fra_silver_source,
    tags={"team": "ml-platform", "project": "fra-taxi-prediction"},
)

11. BentoML Model Serving

11.1 bentoml/services/taxi_predictor.py

"""
BentoML Service: FRA Taxi Time Predictor
Loads the latest production model from MLflow Model Registry and serves predictions.

Deploy with:
    bentoml serve taxi_predictor:FRATaxiPredictor --port 3000
"""

from __future__ import annotations

import numpy as np
import pandas as pd
import bentoml
from bentoml.io import JSON
from pydantic import BaseModel


class TaxiPredictInput(BaseModel):
    wind_speed_kt:      float
    wind_dir_deg:       int
    temperature_c:      float
    visibility_m:       int
    hour_utc:           int
    is_peak_hour:       bool
    is_low_visibility:  bool
    crosswind_component: float
    aircraft_type:      str    # encoded as int in model; mapping below
    taxi_direction:     str    # "inbound" -> 0, "outbound" -> 1


class TaxiPredictOutput(BaseModel):
    predicted_taxi_time_sec: float
    predicted_taxi_time_min: float
    model_version:           str


AIRCRAFT_TYPE_MAP  = {"A320": 0, "B737": 1, "A380": 2, "B777": 3, "E190": 4, "A350": 5}
TAXI_DIRECTION_MAP = {"inbound": 0, "outbound": 1}

# ── Load model from MLflow Model Registry ────────────────────────────────────
# The model must first be registered via MLflow UI: Models > fra_taxi_predictor > Production
runner = bentoml.mlflow.get("fra_taxi_predictor:latest").to_runner()

svc = bentoml.Service("FRATaxiPredictor", runners=[runner])


@svc.api(input=JSON(pydantic_model=TaxiPredictInput),
         output=JSON(pydantic_model=TaxiPredictOutput))
async def predict(data: TaxiPredictInput) -> TaxiPredictOutput:
    features = pd.DataFrame([{
        "wind_speed_kt":       data.wind_speed_kt,
        "wind_dir_deg":        data.wind_dir_deg,
        "temperature_c":       data.temperature_c,
        "visibility_m":        data.visibility_m,
        "hour_utc":            data.hour_utc,
        "is_peak_hour":        int(data.is_peak_hour),
        "is_low_visibility":   int(data.is_low_visibility),
        "crosswind_component": data.crosswind_component,
        "aircraft_type_enc":   AIRCRAFT_TYPE_MAP.get(data.aircraft_type, -1),
        "taxi_direction_enc":  TAXI_DIRECTION_MAP.get(data.taxi_direction, 0),
    }])

    result = await runner.async_run(features)
    pred_sec = float(np.array(result).flatten()[0])

    return TaxiPredictOutput(
        predicted_taxi_time_sec=pred_sec,
        predicted_taxi_time_min=pred_sec / 60.0,
        model_version="latest",
    )

12. Qdrant Vector Database

No custom config file is needed for basic Qdrant deployment. All configuration is done via environment variables in docker-compose. The data volume is ./data/qdrant.

Qdrant REST API will be available at http://localhost:6333. Qdrant gRPC API will be available at http://localhost:6334.

To create a collection for FinLakehouse RAG (run after startup):

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

client = QdrantClient(host="localhost", port=6333)
client.recreate_collection(
    collection_name="fra_sop_documents",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)

13. Marquez — Data Lineage

13.1 marquez/marquez.yml

server:
  applicationConnectors:
    - type: http
      port: 5000
  adminConnectors:
    - type: http
      port: 5001

db:
  driverClass: org.postgresql.Driver
  url: jdbc:postgresql://postgres:5432/marquez
  user: marquez
  password: CHANGE_ME_postgres_secret

migrateOnStartup: true

tags: []

13.2 Airflow OpenLineage Integration

Add the following to the Airflow airflow-webserver and airflow-scheduler environment in docker-compose:

OPENLINEAGE_URL: http://marquez:5000
OPENLINEAGE_NAMESPACE: sololakehouse
AIRFLOW__LINEAGE__BACKEND: openlineage

No plugin file is needed — the openlineage-airflow pip package handles integration automatically.


14. Evidently AI — ML Monitoring

14.1 evidently/config.yml

# Evidently UI Server configuration
service:
  host: 0.0.0.0
  port: 8085

# Workspace directory (mapped to ./data/evidently inside container)
workspace_path: /app/workspace

# Enable all report types
ui:
  show_all_reports: true

14.2 How to Add a Monitor (run after first model deployment)

# Run this script once to set up the FRA Taxi monitoring project
# Execute from within the jupyter container or locally with proper env vars

import evidently
from evidently.ui.workspace import Workspace
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, RegressionPreset

ws = Workspace("http://evidently:8085")
project = ws.create_project("FRA Taxi Time Predictor")
project.description = "Monitor data drift and regression quality for taxi time model"
project.save()

15. Jupyter Lab — Development Environment

15.1 Custom Startup Script

File: notebooks/startup.sh (mounted into Jupyter container)

#!/bin/bash
# Install additional packages not in the base pyspark-notebook image
pip install --quiet \
    mlflow==2.16.0 \
    feast==0.40.0 \
    pyiceberg[s3]==0.7.1 \
    trino[sqlalchemy]==0.329.0 \
    qdrant-client==1.11.0 \
    evidently==0.4.33 \
    openlineage-python==1.22.0 \
    bentoml==1.3.0 \
    boto3==1.34.0 \
    pandas==2.2.0 \
    scikit-learn==1.5.0 \
    xgboost==2.1.0

echo "SoloLakehouse v2 packages installed."

16. Prometheus Scrape Configuration

File: monitoring/prometheus/prometheus.yml

This replaces the existing file entirely. All previous targets are preserved and new ones added.
global:
  scrape_interval:     15s
  evaluation_interval: 15s
  external_labels:
    cluster: sololakehouse

scrape_configs:

  # ── Infrastructure ──────────────────────────────────────────────────────────
  - job_name: node-exporter
    static_configs:
      - targets: ['node-exporter:9100']
        labels:
          instance: sololakehouse-host

  - job_name: cadvisor
    static_configs:
      - targets: ['cadvisor:8080']
        labels:
          instance: sololakehouse-containers

  # ── Query Engine ─────────────────────────────────────────────────────────────
  - job_name: trino
    metrics_path: /metrics
    static_configs:
      - targets: ['trino:8080']
        labels:
          component: trino

  # ── Spark ────────────────────────────────────────────────────────────────────
  - job_name: spark-master
    metrics_path: /metrics/prometheus
    static_configs:
      - targets: ['spark-master:8080']
        labels:
          component: spark-master

  - job_name: spark-worker
    metrics_path: /metrics/prometheus
    static_configs:
      - targets: ['spark-worker:8081']
        labels:
          component: spark-worker

  # ── ML Platform ──────────────────────────────────────────────────────────────
  - job_name: mlflow
    metrics_path: /metrics
    static_configs:
      - targets: ['mlflow:5000']
        labels:
          component: mlflow

  # ── Vector DB ────────────────────────────────────────────────────────────────
  - job_name: qdrant
    metrics_path: /metrics
    static_configs:
      - targets: ['qdrant:6333']
        labels:
          component: qdrant

  # ── Nessie ───────────────────────────────────────────────────────────────────
  - job_name: nessie
    metrics_path: /q/metrics
    static_configs:
      - targets: ['nessie:19120']
        labels:
          component: nessie

17. Grafana Provisioning

17.1 grafana/provisioning/datasources/prometheus.yml

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

17.2 grafana/provisioning/dashboards/dashboard.yml

apiVersion: 1
providers:
  - name: SoloLakehouse
    folder: SoloLakehouse
    type: file
    options:
      path: /etc/grafana/provisioning/dashboards

17.3 grafana/provisioning/dashboards/sololakehouse_overview.json

{
  "__inputs": [],
  "__requires": [
    { "type": "datasource", "id": "prometheus", "version": "1.0.0" }
  ],
  "annotations": { "list": [] },
  "description": "SoloLakehouse v2 — Platform Overview",
  "editable": true,
  "graphTooltip": 1,
  "id": null,
  "panels": [
    {
      "datasource": "Prometheus",
      "fieldConfig": { "defaults": { "unit": "percent" } },
      "gridPos": { "h": 4, "w": 6, "x": 0, "y": 0 },
      "id": 1,
      "title": "Host CPU Usage",
      "type": "stat",
      "targets": [{
        "expr": "100 - (avg(rate(node_cpu_seconds_total{mode='idle'}[5m])) * 100)",
        "legendFormat": "CPU %"
      }]
    },
    {
      "datasource": "Prometheus",
      "fieldConfig": { "defaults": { "unit": "bytes" } },
      "gridPos": { "h": 4, "w": 6, "x": 6, "y": 0 },
      "id": 2,
      "title": "Host Memory Used",
      "type": "stat",
      "targets": [{
        "expr": "node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes",
        "legendFormat": "Memory Used"
      }]
    },
    {
      "datasource": "Prometheus",
      "gridPos": { "h": 4, "w": 6, "x": 12, "y": 0 },
      "id": 3,
      "title": "Running Containers",
      "type": "stat",
      "targets": [{
        "expr": "count(container_last_seen{container!='',image!=''})",
        "legendFormat": "Containers"
      }]
    }
  ],
  "schemaVersion": 38,
  "title": "SoloLakehouse v2 Overview",
  "uid": "sololakehouse-overview",
  "version": 1
}

18. Complete docker-compose.yml v2

This is the complete replacement for the existing docker-compose.yml. Every service from v1 is preserved. New services are added. Changed services are noted inline.
# ============================================================
# SoloLakehouse v2 — Complete Docker Compose
# ============================================================

x-airflow-env: &airflow-env
  AIRFLOW__CORE__EXECUTOR: LocalExecutor
  AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:${PG_PASSWORD}@postgres:5432/airflow
  AIRFLOW__CORE__FERNET_KEY: ${AIRFLOW_FERNET_KEY}
  AIRFLOW__WEBSERVER__SECRET_KEY: ${AIRFLOW_SECRET_KEY}
  AIRFLOW__CORE__DAGS_FOLDER: /opt/airflow/dags
  AIRFLOW__CORE__LOAD_EXAMPLES: "false"
  AIRFLOW__LINEAGE__BACKEND: openlineage
  OPENLINEAGE_URL: http://marquez:5000
  OPENLINEAGE_NAMESPACE: sololakehouse
  MLFLOW_TRACKING_URI: http://mlflow:5000
  MLFLOW_S3_ENDPOINT_URL: http://minio:9000
  AWS_ACCESS_KEY_ID: ${MINIO_ROOT_USER}
  AWS_SECRET_ACCESS_KEY: ${MINIO_ROOT_PASSWORD}
  _PIP_ADDITIONAL_REQUIREMENTS: >-
    apache-airflow-providers-apache-spark==4.9.0
    apache-airflow-providers-trino==5.7.0
    openlineage-airflow==1.22.0
    boto3==1.34.0
    trino==0.329.0

services:

  # =========================
  # Nginx Proxy Manager
  # (UNCHANGED from v1)
  # =========================
  npm:
    image: jc21/nginx-proxy-manager:latest
    container_name: sololakehouse-npm
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
      - "127.0.0.1:81:81"
    volumes:
      - ./data/npm/data:/data
      - ./data/npm/letsencrypt:/etc/letsencrypt
      - ./logs/npm:/var/log/nginx
    networks:
      - lake_net

  # =========================
  # MinIO Object Storage
  # (UNCHANGED from v1)
  # =========================
  minio:
    image: minio/minio:RELEASE.2024-11-07T00-52-20Z
    container_name: sololakehouse-minio
    restart: unless-stopped
    expose:
      - "9000"
      - "9001"
    environment:
      MINIO_ROOT_USER: ${MINIO_ROOT_USER}
      MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD}
    command: server /data --console-address ":9001"
    volumes:
      - ./data/minio:/data
    networks:
      - lake_net

  # =========================
  # MinIO Init (one-shot bucket creation)
  # NEW in v2
  # =========================
  minio-init:
    image: minio/mc:latest
    container_name: sololakehouse-minio-init
    restart: "no"
    depends_on:
      - minio
    entrypoint: >
      /bin/sh -c "
        sleep 5;
        mc alias set local http://minio:9000 ${MINIO_ROOT_USER} ${MINIO_ROOT_PASSWORD};
        mc mb -p local/lakehouse;
        mc mb -p local/mlflow;
        mc mb -p local/airflow-logs;
        mc mb -p local/feast-offline;
        echo 'MinIO buckets created.';
      "
    networks:
      - lake_net

  # =========================
  # Postgres
  # CHANGED in v2: added init SQL directory
  # =========================
  postgres:
    image: postgres:16-alpine
    container_name: sololakehouse-postgres
    restart: unless-stopped
    command:
      [
        "postgres",
        "-c", "password_encryption=md5",
        "-c", "hba_file=/etc/postgres/pg_hba.conf"
      ]
    environment:
      POSTGRES_DB: metastore
      POSTGRES_USER: metastore
      POSTGRES_PASSWORD: ${PG_PASSWORD}
    volumes:
      - ./data/postgres:/var/lib/postgresql/data
      - ./postgres/pg_hba.conf:/etc/postgres/pg_hba.conf:ro
      - ./postgres/init:/docker-entrypoint-initdb.d:ro
    expose:
      - "5432"
    networks:
      - lake_net

  # =========================
  # Hive Metastore
  # KEPT for backward compatibility (Trino hive catalog still works)
  # Primary catalog is now Nessie — see iceberg.properties
  # =========================
  hive-metastore:
    image: apache/hive:3.1.3
    container_name: sololakehouse-hive-metastore
    restart: unless-stopped
    depends_on:
      - postgres
    environment:
      - HADOOP_CLIENT_OPTS=-Xmx1G
    volumes:
      - ./hive/conf:/opt/hive/conf
      - ./hive/lib:/opt/hive/lib/custom
      - ./data/hive-metastore/warehouse:/warehouse
    expose:
      - "9083"
    networks:
      - lake_net
    entrypoint: /bin/bash
    command:
      - -c
      - >
        export HADOOP_CLASSPATH=/opt/hive/lib/custom/*:$HADOOP_CLASSPATH;
        /opt/hive/bin/schematool -dbType postgres -initSchema --verbose || true;
        /opt/hive/bin/hive --service metastore

  # =========================
  # Project Nessie (Git-like Iceberg REST Catalog)
  # NEW in v2 — replaces Hive as primary catalog
  # =========================
  nessie:
    image: projectnessie/nessie:0.95.0
    container_name: sololakehouse-nessie
    restart: unless-stopped
    depends_on:
      - postgres
    environment:
      nessie.version.store.type: JDBC
      quarkus.datasource.jdbc.url: jdbc:postgresql://postgres:5432/nessie
      quarkus.datasource.username: nessie
      quarkus.datasource.password: ${PG_PASSWORD}
      quarkus.http.port: "19120"
      quarkus.http.host: "0.0.0.0"
      quarkus.http.cors: "true"
      quarkus.http.cors.origins: "*"
    expose:
      - "19120"
    networks:
      - lake_net

  # =========================
  # Trino (SQL Query Engine)
  # CHANGED in v2: now uses Iceberg + Nessie catalog
  # =========================
  trino:
    image: trinodb/trino:455
    container_name: sololakehouse-trino
    restart: unless-stopped
    depends_on:
      - nessie
      - minio
      - hive-metastore
    ports:
      - "127.0.0.1:8080:8080"
    volumes:
      - ./trino/etc:/etc/trino
    environment:
      MINIO_ROOT_USER: ${MINIO_ROOT_USER}
      MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD}
    networks:
      - lake_net

  # =========================
  # Apache Spark (Standalone Cluster)
  # NEW in v2 — Databricks compute layer equivalent
  # =========================
  spark-master:
    image: bitnami/spark:3.5.3
    container_name: sololakehouse-spark-master
    restart: unless-stopped
    environment:
      SPARK_MODE: master
      SPARK_MASTER_HOST: spark-master
      SPARK_RPC_AUTHENTICATION_ENABLED: "no"
      SPARK_RPC_ENCRYPTION_ENABLED: "no"
      AWS_ACCESS_KEY_ID: ${MINIO_ROOT_USER}
      AWS_SECRET_ACCESS_KEY: ${MINIO_ROOT_PASSWORD}
    expose:
      - "7077"
    ports:
      - "127.0.0.1:4040:8080"
    volumes:
      - ./spark/conf/spark-defaults.conf:/opt/bitnami/spark/conf/spark-defaults.conf:ro
      - ./spark/conf/log4j2.properties:/opt/bitnami/spark/conf/log4j2.properties:ro
      - ./data/spark-events:/tmp/spark-events
    networks:
      - lake_net

  spark-worker:
    image: bitnami/spark:3.5.3
    container_name: sololakehouse-spark-worker
    restart: unless-stopped
    depends_on:
      - spark-master
    environment:
      SPARK_MODE: worker
      SPARK_MASTER_URL: spark://spark-master:7077
      SPARK_WORKER_CORES: "4"
      SPARK_WORKER_MEMORY: 8G
      AWS_ACCESS_KEY_ID: ${MINIO_ROOT_USER}
      AWS_SECRET_ACCESS_KEY: ${MINIO_ROOT_PASSWORD}
    expose:
      - "8081"
    volumes:
      - ./spark/conf/spark-defaults.conf:/opt/bitnami/spark/conf/spark-defaults.conf:ro
      - ./data/spark-events:/tmp/spark-events
    networks:
      - lake_net

  # =========================
  # CloudBeaver (Web SQL IDE)
  # (UNCHANGED from v1)
  # =========================
  cloudbeaver:
    image: dbeaver/cloudbeaver:25.3.3
    container_name: sololakehouse-cloudbeaver
    restart: unless-stopped
    depends_on:
      - trino
    expose:
      - "8978"
    volumes:
      - ./data/cloudbeaver:/opt/cloudbeaver/workspace
    networks:
      - lake_net

  # =========================
  # MLflow
  # CHANGED in v2: Postgres backend instead of SQLite
  # =========================
  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.16.0
    container_name: sololakehouse-mlflow
    restart: unless-stopped
    depends_on:
      - postgres
      - minio
    ports:
      - "127.0.0.1:5000:5000"
    environment:
      MLFLOW_S3_ENDPOINT_URL: http://minio:9000
      AWS_ACCESS_KEY_ID: ${MINIO_ROOT_USER}
      AWS_SECRET_ACCESS_KEY: ${MINIO_ROOT_PASSWORD}
      AWS_DEFAULT_REGION: us-east-1
      MLFLOW_S3_IGNORE_TLS: "true"
    volumes:
      - ./data/mlflow:/mlflow
    command:
      - /bin/sh
      - -lc
      - >
        pip install --quiet psycopg2-binary boto3 &&
        mlflow server
        --host 0.0.0.0
        --port 5000
        --backend-store-uri postgresql://mlflow:${PG_PASSWORD}@postgres:5432/mlflow
        --default-artifact-root s3://mlflow
        --allowed-hosts solo-mlflow.sololake.space,localhost,127.0.0.1
    networks:
      - lake_net

  # =========================
  # Airflow Webserver
  # NEW in v2 — Databricks Workflows equivalent
  # =========================
  airflow-webserver:
    image: apache/airflow:2.10.0
    container_name: sololakehouse-airflow-web
    restart: unless-stopped
    depends_on:
      - postgres
      - minio
    environment:
      <<: *airflow-env
      _AIRFLOW_WWW_USER_CREATE: "true"
      _AIRFLOW_WWW_USER_USERNAME: admin
      _AIRFLOW_WWW_USER_PASSWORD: ${AIRFLOW_ADMIN_PASSWORD}
    ports:
      - "127.0.0.1:8083:8080"
    volumes:
      - ./airflow/dags:/opt/airflow/dags
      - ./airflow/plugins:/opt/airflow/plugins
      - ./data/airflow:/opt/airflow/logs
    command: webserver
    networks:
      - lake_net

  airflow-scheduler:
    image: apache/airflow:2.10.0
    container_name: sololakehouse-airflow-scheduler
    restart: unless-stopped
    depends_on:
      - airflow-webserver
    environment:
      <<: *airflow-env
    volumes:
      - ./airflow/dags:/opt/airflow/dags
      - ./airflow/plugins:/opt/airflow/plugins
      - ./data/airflow:/opt/airflow/logs
    command: scheduler
    networks:
      - lake_net

  airflow-init:
    image: apache/airflow:2.10.0
    container_name: sololakehouse-airflow-init
    restart: "no"
    depends_on:
      - postgres
    environment:
      <<: *airflow-env
    command: db migrate
    networks:
      - lake_net

  # =========================
  # Redis (Online store for Feast)
  # NEW in v2
  # =========================
  redis:
    image: redis:7.4-alpine
    container_name: sololakehouse-redis
    restart: unless-stopped
    expose:
      - "6379"
    volumes:
      - ./data/redis:/data
    command: redis-server --appendonly yes
    networks:
      - lake_net

  # =========================
  # Feast Feature Server
  # NEW in v2 — Databricks Feature Engineering equivalent
  # =========================
  feast:
    image: feastdev/feature-server:0.40.0
    container_name: sololakehouse-feast
    restart: unless-stopped
    depends_on:
      - postgres
      - redis
      - trino
    environment:
      FEAST_USAGE: "False"
    volumes:
      - ./feast/feature_store.yaml:/app/feature_store.yaml:ro
      - ./feast/features:/app/features:ro
    ports:
      - "127.0.0.1:6566:6566"
    command: feast serve --host 0.0.0.0 --port 6566
    networks:
      - lake_net

  # =========================
  # BentoML Model Serving
  # NEW in v2 — Databricks Model Serving equivalent
  # =========================
  bentoml:
    image: python:3.11-slim
    container_name: sololakehouse-bentoml
    restart: unless-stopped
    depends_on:
      - mlflow
    environment:
      MLFLOW_TRACKING_URI: http://mlflow:5000
      MLFLOW_S3_ENDPOINT_URL: http://minio:9000
      AWS_ACCESS_KEY_ID: ${MINIO_ROOT_USER}
      AWS_SECRET_ACCESS_KEY: ${MINIO_ROOT_PASSWORD}
      BENTOML_HOME: /bentoml
    volumes:
      - ./data/bentoml:/bentoml
      - ./bentoml/services:/services
    ports:
      - "127.0.0.1:3001:3000"
    command: >
      /bin/sh -c "
        pip install --quiet bentoml==1.3.0 mlflow==2.16.0 boto3 scikit-learn pandas numpy &&
        cd /services &&
        bentoml serve taxi_predictor:FRATaxiPredictor --host 0.0.0.0 --port 3000 --reload
      "
    networks:
      - lake_net

  # =========================
  # Qdrant Vector Database
  # NEW in v2 — Databricks Vector Search equivalent
  # =========================
  qdrant:
    image: qdrant/qdrant:v1.11.0
    container_name: sololakehouse-qdrant
    restart: unless-stopped
    expose:
      - "6333"
      - "6334"
    ports:
      - "127.0.0.1:6333:6333"
    volumes:
      - ./data/qdrant:/qdrant/storage
    networks:
      - lake_net

  # =========================
  # Marquez (Data Lineage)
  # NEW in v2 — Unity Catalog Lineage equivalent
  # =========================
  marquez:
    image: marquezproject/marquez:0.50.0
    container_name: sololakehouse-marquez
    restart: unless-stopped
    depends_on:
      - postgres
    volumes:
      - ./marquez/marquez.yml:/opt/marquez/marquez.yml:ro
    expose:
      - "5000"
      - "5001"
    ports:
      - "127.0.0.1:3002:5000"
    command: ["--config", "/opt/marquez/marquez.yml"]
    networks:
      - lake_net

  marquez-web:
    image: marquezproject/marquez-web:0.50.0
    container_name: sololakehouse-marquez-web
    restart: unless-stopped
    depends_on:
      - marquez
    environment:
      MARQUEZ_HOST: marquez
      MARQUEZ_PORT: "5000"
    expose:
      - "3000"
    ports:
      - "127.0.0.1:3003:3000"
    networks:
      - lake_net

  # =========================
  # Evidently AI (ML Monitoring)
  # NEW in v2 — Lakehouse Monitoring equivalent
  # =========================
  evidently:
    image: evidently/evidently-service:latest
    container_name: sololakehouse-evidently
    restart: unless-stopped
    volumes:
      - ./data/evidently:/app/workspace
      - ./evidently/config.yml:/app/config.yml:ro
    ports:
      - "127.0.0.1:8085:8085"
    networks:
      - lake_net

  # =========================
  # Jupyter Lab (Development Environment)
  # NEW in v2
  # =========================
  jupyter:
    image: jupyter/pyspark-notebook:spark-3.5.0
    container_name: sololakehouse-jupyter
    restart: unless-stopped
    depends_on:
      - spark-master
      - mlflow
      - minio
      - trino
    environment:
      JUPYTER_ENABLE_LAB: "yes"
      JUPYTER_TOKEN: ${JUPYTER_TOKEN}
      MLFLOW_TRACKING_URI: http://mlflow:5000
      MLFLOW_S3_ENDPOINT_URL: http://minio:9000
      AWS_ACCESS_KEY_ID: ${MINIO_ROOT_USER}
      AWS_SECRET_ACCESS_KEY: ${MINIO_ROOT_PASSWORD}
      SPARK_MASTER: spark://spark-master:7077
      OPENLINEAGE_URL: http://marquez:5000
    volumes:
      - ./notebooks:/home/jovyan/work
      - ./spark/conf/spark-defaults.conf:/usr/local/spark/conf/spark-defaults.conf:ro
      - ./notebooks/startup.sh:/usr/local/bin/before-notebook.d/startup.sh:ro
    ports:
      - "127.0.0.1:8888:8888"
    networks:
      - lake_net

  # =========================
  # Beszel Hub
  # (UNCHANGED from v1)
  # =========================
  beszel:
    image: henrygd/beszel:latest
    container_name: sololakehouse-beszel
    restart: unless-stopped
    ports:
      - "127.0.0.1:8090:8090"
    environment:
      APP_URL: http://localhost:8090
    volumes:
      - ./data/beszel/hub:/beszel_data
      - ./data/beszel/socket:/beszel_socket
    networks:
      - lake_net

  beszel-agent:
    image: henrygd/beszel-agent:latest
    container_name: sololakehouse-beszel-agent
    restart: unless-stopped
    network_mode: host
    volumes:
      - ./data/beszel/agent:/var/lib/beszel-agent
      - ./data/beszel/socket:/beszel_socket
      - /var/run/docker.sock:/var/run/docker.sock:ro
    environment:
      LISTEN: /beszel_socket/beszel.sock
      HUB_URL: http://localhost:8090
      TOKEN: ${BESZEL_TOKEN}
      KEY: ${BESZEL_KEY}

  # =========================
  # Prometheus
  # CHANGED in v2: new prometheus.yml with more scrape targets
  # =========================
  prometheus:
    image: prom/prometheus:latest
    container_name: sololakehouse-prometheus
    restart: unless-stopped
    ports:
      - "127.0.0.1:9090:9090"
    volumes:
      - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./data/prometheus:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"
    networks:
      - lake_net

  # =========================
  # Grafana
  # CHANGED in v2: provisioning mounts added
  # =========================
  grafana:
    image: grafana/grafana:latest
    container_name: sololakehouse-grafana
    restart: unless-stopped
    depends_on:
      - prometheus
    ports:
      - "127.0.0.1:3000:3000"
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD}
      GF_SERVER_SERVE_FROM_SUB_PATH: "false"
    volumes:
      - ./data/grafana:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    networks:
      - lake_net

  # =========================
  # Node Exporter
  # (UNCHANGED from v1)
  # =========================
  node-exporter:
    image: prom/node-exporter:latest
    container_name: sololakehouse-node-exporter
    restart: unless-stopped
    expose:
      - "9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - "--path.procfs=/host/proc"
      - "--path.sysfs=/host/sys"
      - "--path.rootfs=/rootfs"
    networks:
      - lake_net

  # =========================
  # cAdvisor
  # (UNCHANGED from v1)
  # =========================
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: sololakehouse-cadvisor
    restart: unless-stopped
    expose:
      - "8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    networks:
      - lake_net

  # =========================
  # MkDocs Material
  # (UNCHANGED from v1)
  # =========================
  mkdocs:
    image: squidfunk/mkdocs-material:latest
    container_name: sololakehouse-mkdocs
    restart: unless-stopped
    ports:
      - "127.0.0.1:8000:8000"
    volumes:
      - ./docs:/docs
      - ./docs/data/mkdocs-cache:/root/.cache
    environment:
      - PYTHONUNBUFFERED=1
    command:
      - serve
      - --dev-addr=0.0.0.0:8000
      - --livereload
      - --dirty
    networks:
      - lake_net


networks:
  lake_net:
    name: sololakehouse_lake_net
    driver: bridge

19. MinIO Bucket Bootstrap Script

File: scripts/bootstrap.sh

This script is idempotent. Run it once after docker compose up -d to verify the platform is healthy. It does NOT replace the minio-init container (which runs automatically); it adds post-startup verifications and Airflow variable seeding.
#!/usr/bin/env bash
set -euo pipefail

# Load .env
set -a; source .env; set +a

echo "======================================================"
echo " SoloLakehouse v2 — Bootstrap & Verification Script"
echo "======================================================"

# ── 1. Wait for Postgres ──────────────────────────────────────────────────────
echo "[1/7] Waiting for Postgres..."
until docker exec sololakehouse-postgres pg_isready -U metastore > /dev/null 2>&1; do
  sleep 2
done
echo "      Postgres: OK"

# ── 2. Wait for Nessie ───────────────────────────────────────────────────────
echo "[2/7] Waiting for Nessie REST catalog..."
until curl -sf http://localhost:19120/api/v1/config > /dev/null 2>&1; do
  # Nessie is exposed internally; access via Trino container
  sleep 3
done || echo "      Nessie: not directly accessible from host (expected) — OK via Trino"

# ── 3. Wait for Trino ────────────────────────────────────────────────────────
echo "[3/7] Waiting for Trino..."
until curl -sf http://localhost:8080/v1/info > /dev/null 2>&1; do
  sleep 3
done
echo "      Trino: OK"

# ── 4. Verify Iceberg catalog in Trino ───────────────────────────────────────
echo "[4/7] Verifying Iceberg catalog via Trino..."
docker exec sololakehouse-trino \
  trino --execute "SHOW CATALOGS" 2>/dev/null | grep -q "iceberg" \
  && echo "      iceberg catalog: OK" \
  || echo "      WARNING: iceberg catalog not yet available — check nessie logs"

# ── 5. Wait for MLflow ───────────────────────────────────────────────────────
echo "[5/7] Waiting for MLflow..."
until curl -sf http://localhost:5000/health > /dev/null 2>&1; do
  sleep 3
done
echo "      MLflow: OK"

# ── 6. Wait for Airflow ──────────────────────────────────────────────────────
echo "[6/7] Waiting for Airflow Webserver..."
until curl -sf http://localhost:8083/health > /dev/null 2>&1; do
  sleep 5
done
echo "      Airflow: OK"

# Seed Airflow Variables for DAGs
echo "      Seeding Airflow Variables..."
docker exec sololakehouse-airflow-web \
  airflow variables set minio_access_key "${MINIO_ROOT_USER}" || true
docker exec sololakehouse-airflow-web \
  airflow variables set minio_secret_key "${MINIO_ROOT_PASSWORD}" || true

# ── 7. Create Iceberg schemas ────────────────────────────────────────────────
echo "[7/7] Creating Iceberg Bronze/Silver/Gold schemas..."
docker exec sololakehouse-trino trino --execute \
  "CREATE SCHEMA IF NOT EXISTS iceberg.bronze WITH (location = 's3://lakehouse/bronze/')" \
  2>/dev/null && echo "      bronze: OK" || true
docker exec sololakehouse-trino trino --execute \
  "CREATE SCHEMA IF NOT EXISTS iceberg.silver WITH (location = 's3://lakehouse/silver/')" \
  2>/dev/null && echo "      silver: OK" || true
docker exec sololakehouse-trino trino --execute \
  "CREATE SCHEMA IF NOT EXISTS iceberg.gold WITH (location = 's3://lakehouse/gold/')" \
  2>/dev/null && echo "      gold:   OK" || true

echo ""
echo "======================================================"
echo " Bootstrap complete. SoloLakehouse v2 is ready."
echo "======================================================"
echo ""
echo "  MLflow:    http://localhost:5000"
echo "  Airflow:   http://localhost:8083"
echo "  Trino:     http://localhost:8080"
echo "  Jupyter:   http://localhost:8888"
echo "  Grafana:   http://localhost:3000"
echo "  Qdrant:    http://localhost:6333"
echo "  Marquez:   http://localhost:3002"
echo "  Evidently: http://localhost:8085"
echo ""

20. Airflow Bootstrap DAG — Frankfurt Airport Taxi Pipeline

The two DAG files are specified in Section 8. Place both files in ./airflow/dags/. They will be auto-discovered by the scheduler.

21. Sample Notebook — End-to-End Platform Smoke Test

File: notebooks/00_platform_smoke_test.ipynb

Create this as a standard Jupyter notebook with the following cells. Each cell corresponds to one platform component.
# Cell 1 — Title
# SoloLakehouse v2 — End-to-End Platform Smoke Test
# Run this notebook to verify all platform components are working together.

# Cell 2 — MinIO connectivity
import boto3

s3 = boto3.client(
    "s3",
    endpoint_url="http://minio:9000",
    aws_access_key_id="minioadmin",       # replace with ${MINIO_ROOT_USER}
    aws_secret_access_key="CHANGE_ME",    # replace with ${MINIO_ROOT_PASSWORD}
)
buckets = [b["Name"] for b in s3.list_buckets()["Buckets"]]
print("MinIO buckets:", buckets)
assert "lakehouse" in buckets, "lakehouse bucket missing"
assert "mlflow" in buckets, "mlflow bucket missing"
print("✅ MinIO: OK")

# Cell 3 — Trino + Iceberg
import trino

conn = trino.dbapi.connect(host="trino", port=8080, user="smoke-test")
cur = conn.cursor()
cur.execute("SHOW CATALOGS")
catalogs = [row[0] for row in cur.fetchall()]
print("Trino catalogs:", catalogs)
assert "iceberg" in catalogs, "iceberg catalog not found in Trino"
print("✅ Trino + Iceberg catalog: OK")

# Cell 4 — Write an Iceberg table via Trino
cur.execute("""
    CREATE TABLE IF NOT EXISTS iceberg.bronze.smoke_test (
        id INTEGER, value VARCHAR
    ) WITH (format = 'PARQUET', location = 's3://lakehouse/bronze/smoke_test/')
""")
cur.execute("INSERT INTO iceberg.bronze.smoke_test VALUES (1, 'hello_sololakehouse')")
cur.execute("SELECT * FROM iceberg.bronze.smoke_test")
rows = cur.fetchall()
print("Iceberg write/read:", rows)
assert rows[0][1] == "hello_sololakehouse"
print("✅ Iceberg ACID write via Trino: OK")

# Cell 5 — PySpark with Iceberg + Nessie
from pyspark.sql import SparkSession

spark = (SparkSession.builder
    .appName("smoke-test")
    .master("spark://spark-master:7077")
    .getOrCreate())

df = spark.createDataFrame([(1, "spark-test")], ["id", "name"])
df.writeTo("nessie.bronze.spark_smoke_test").createOrReplace()
result = spark.table("nessie.bronze.spark_smoke_test").collect()
print("Spark + Nessie write:", result)
print("✅ PySpark + Nessie Iceberg: OK")
spark.stop()

# Cell 6 — MLflow
import mlflow

mlflow.set_tracking_uri("http://mlflow:5000")
mlflow.set_experiment("smoke-test")

with mlflow.start_run(run_name="platform-verify"):
    mlflow.log_param("component", "sololakehouse-v2")
    mlflow.log_metric("smoke_test_passed", 1.0)

print("✅ MLflow experiment tracking: OK")

# Cell 7 — Qdrant
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

qc = QdrantClient(host="qdrant", port=6333)
qc.recreate_collection(
    collection_name="smoke_test",
    vectors_config=VectorParams(size=4, distance=Distance.COSINE)
)
collections = [c.name for c in qc.get_collections().collections]
print("Qdrant collections:", collections)
assert "smoke_test" in collections
print("✅ Qdrant: OK")

# Cell 8 — Summary
print("")
print("══════════════════════════════════")
print("  SoloLakehouse v2 Smoke Test")
print("  ALL COMPONENTS: PASSED ✅")
print("══════════════════════════════════")

22. Startup Sequence

The agent must execute these commands in order on the target host.

# Step 1: Clone / copy the repository
cd /opt/sololakehouse   # or wherever you keep the project

# Step 2: Create .env and fill in all CHANGE_ME values
cp .env.template .env
# Edit .env manually or via sed for CI environments

# Step 3: Create all required data directories
mkdir -p data/{npm/{data,letsencrypt},minio,postgres,hive-metastore/warehouse}
mkdir -p data/{mlflow,cloudbeaver,grafana,prometheus,beszel/{hub,agent,socket}}
mkdir -p data/{redis,qdrant,evidently,bentoml,airflow}
mkdir -p data/{spark-events,nessie}
mkdir -p logs/npm

# Step 4: Pull all images in parallel (saves startup time)
docker compose pull

# Step 5: Start infrastructure tier first (postgres, minio, nessie)
docker compose up -d postgres minio
sleep 10

# Step 6: Run minio-init and airflow-init
docker compose up -d minio-init airflow-init
sleep 15

# Step 7: Start catalog and compute tier
docker compose up -d nessie hive-metastore
sleep 15

# Step 8: Start all remaining services
docker compose up -d

# Step 9: Run bootstrap script
chmod +x scripts/bootstrap.sh
./scripts/bootstrap.sh

# Step 10: Trigger the first Airflow DAG manually to verify E2E pipeline
docker exec sololakehouse-airflow-web \
  airflow dags trigger fra_taxi_bronze_ingestion

23. Verification Checklist

After running the startup sequence, verify each item manually:

# Check Command / URL Expected Result
1 Postgres databases exist docker exec sololakehouse-postgres psql -U metastore -c "\l" Lists: metastore, nessie, mlflow, airflow, feast, marquez
2 MinIO buckets http://localhost:9001 (Console) Buckets: lakehouse, mlflow, airflow-logs, feast-offline
3 Nessie REST curl http://localhost:19120/api/v1/config via Trino container JSON response with defaultBranch: main
4 Trino Iceberg catalog docker exec sololakehouse-trino trino --execute "SHOW CATALOGS" Includes iceberg
5 Iceberg schemas docker exec sololakehouse-trino trino --execute "SHOW SCHEMAS IN iceberg" Includes bronze, silver, gold
6 Spark cluster http://localhost:4040 1 worker registered, status ALIVE
7 MLflow Postgres http://localhost:5000 UI loads; no sqlite errors in logs
8 Airflow http://localhost:8083 DAGs fra_taxi_bronze_ingestion and fra_taxi_silver_transform visible
9 Feast curl http://localhost:6566/health {"status": "up"}
10 Qdrant http://localhost:6333/dashboard Dashboard loads
11 Marquez http://localhost:3002/api/v1/namespaces JSON with namespaces list
12 Evidently http://localhost:8085 UI loads
13 Jupyter http://localhost:8888 Lab interface loads with token
14 Grafana http://localhost:3000 Login works; SoloLakehouse dashboard visible
15 E2E Pipeline Run smoke test notebook All cells pass with ✅

Appendix A — Key Architecture Decisions (ADR Summary)

Decision Chosen Rejected Reason
Table Format Apache Iceberg Delta Lake Iceberg is vendor-neutral; Nessie only supports Iceberg
Catalog Project Nessie Apache Atlas, OpenMetadata Nessie implements Iceberg REST spec natively; git-like branching
Orchestration Apache Airflow Prefect, Dagster Widest adoption, best provider ecosystem, Databricks Workflows mental model
Feature Store Feast Hopsworks, custom Delta tables Feast is purpose-built, has Trino offline store support, Redis online store
Model Serving BentoML Seldon, Ray Serve BentoML has native MLflow Registry integration and simplest Docker deployment
Vector DB Qdrant Weaviate, Milvus Best performance/resource ratio for single-node self-hosted deployment
Lineage Marquez + OpenLineage Apache Atlas Marquez is the reference OpenLineage server; Airflow + Spark have first-class plugins
ML Monitoring Evidently Whylogs, Alibi Detect Best UI, most active open-source community, Grafana-compatible metrics export

End of SoloLakehouse v2 Build Specification