SoloLakehouse v2 - Complete Build Specification
Purpose: This document is a fully executable specification for a code agent to build SoloLakehouse v2 from scratch. Every file that must be created is listed with its exact path and complete contents. Follow the sections in order. Do not skip steps.
Base: The existingdocker-compose.yml(v1) contains: Nginx Proxy Manager, MinIO, Postgres 16, Hive Metastore 3.1.3, Trino 455, CloudBeaver, MLflow (sqlite), Beszel, Prometheus, Grafana, Node Exporter, cAdvisor, MkDocs. This spec upgrades and extends that base into a complete Databricks-equivalent open-source AI Lakehouse.
Table of Contents
- Architecture Overview
- Repository Layout
- Environment Variables —
.env - Postgres Initialization
- Nessie Catalog Configuration
- Trino Configuration
- Spark Configuration
- Airflow DAGs and Configuration
- MLflow — Upgrade to Postgres Backend
- Feast Feature Store
- BentoML Model Serving
- Qdrant Vector Database
- Marquez — Data Lineage
- Evidently AI — ML Monitoring
- Jupyter Lab — Development Environment
- Prometheus Scrape Configuration
- Grafana Provisioning
- Complete
docker-compose.ymlv2 - MinIO Bucket Bootstrap Script
- Airflow Bootstrap DAG — Frankfurt Airport Taxi Pipeline
- Sample Notebook — End-to-End Platform Smoke Test
- Startup Sequence
- Verification Checklist
1. Architecture Overview
┌─────────────────────────────────────────────────────────────────────────┐
│ SoloLakehouse v2 │
│ │
│ INGESTION STORAGE CATALOG QUERY │
│ ───────── ─────── ─────── ───── │
│ Airflow ──────▶ MinIO (S3) ◀──── Nessie ◀──── Trino │
│ (ETL/ELT) (Iceberg (REST (SQL) │
│ Parquet) Catalog) │
│ │
│ COMPUTE ML PLATFORM AI LAYER GOVERNANCE │
│ ─────── ─────────── ──────── ────────── │
│ Spark ──────▶ MLflow ──▶ Qdrant ──▶ Marquez │
│ (Batch/Stream) (Tracking + (Vector (Lineage) │
│ Registry) Search) │
│ Feast BentoML Evidently │
│ (Feature Store) (Serving) (ML Monitor) │
│ │
│ OBSERVABILITY DEVELOPER UX │
│ ───────────── ──────────── │
│ Prometheus Jupyter Lab │
│ Grafana CloudBeaver │
│ Beszel MkDocs │
│ Node Exporter │
│ cAdvisor │
│ │
│ REVERSE PROXY: Nginx Proxy Manager (NPM) — all services via subdomain │
└─────────────────────────────────────────────────────────────────────────┘
Service Port Map
| Service | Internal Port | Localhost Binding | NPM Subdomain (example) |
|---|---|---|---|
| NPM Admin | 81 | 127.0.0.1:81 | — |
| MinIO API | 9000 | expose only | — |
| MinIO Console | 9001 | expose only | solo-minio.sololake.space |
| Trino | 8080 | 127.0.0.1:8080 | solo-trino.sololake.space |
| CloudBeaver | 8978 | expose only | solo-dbeaver.sololake.space |
| MLflow | 5000 | 127.0.0.1:5000 | solo-mlflow.sololake.space |
| Airflow | 8080 | 127.0.0.1:8083 | solo-airflow.sololake.space |
| Spark Master UI | 8080 | 127.0.0.1:4040 | solo-spark.sololake.space |
| Jupyter Lab | 8888 | 127.0.0.1:8888 | solo-jupyter.sololake.space |
| Feast | 6566 | 127.0.0.1:6566 | solo-feast.sololake.space |
| BentoML | 3000 | 127.0.0.1:3001 | solo-bentoml.sololake.space |
| Qdrant REST | 6333 | 127.0.0.1:6333 | solo-qdrant.sololake.space |
| Marquez UI | 3000 | 127.0.0.1:3002 | solo-marquez.sololake.space |
| Evidently | 8085 | 127.0.0.1:8085 | solo-evidently.sololake.space |
| Nessie | 19120 | expose only | solo-nessie.sololake.space |
| Grafana | 3000 | 127.0.0.1:3000 | solo-grafana.sololake.space |
| Prometheus | 9090 | 127.0.0.1:9090 | solo-prometheus.sololake.space |
| Beszel | 8090 | 127.0.0.1:8090 | solo-beszel.sololake.space |
| MkDocs | 8000 | 127.0.0.1:8000 | solo-docs.sololake.space |
2. Repository Layout
The agent must create the following directory and file structure before writing any file contents.
sololakehouse/
├── .env # All secrets and config vars
├── docker-compose.yml # Complete v2 compose file
├── scripts/
│ └── bootstrap.sh # One-shot init script
│
├── postgres/
│ ├── pg_hba.conf # KEEP existing file unchanged
│ └── init/
│ └── 01_init_databases.sql # CREATE all required databases
│
├── nessie/
│ └── application.properties # Nessie Quarkus config (optional override)
│
├── trino/
│ └── etc/
│ ├── config.properties
│ ├── jvm.config
│ ├── node.properties
│ └── catalog/
│ ├── iceberg.properties # NEW: Iceberg + Nessie REST catalog
│ └── tpch.properties # Keep for testing
│
├── spark/
│ └── conf/
│ ├── spark-defaults.conf # Iceberg + MinIO S3A config
│ └── log4j2.properties
│
├── airflow/
│ ├── dags/
│ │ ├── fra_taxi_bronze_ingestion.py # Demo pipeline DAG
│ │ └── fra_taxi_silver_transform.py
│ └── plugins/
│ └── openlineage_plugin.py # Auto-inject lineage to Marquez
│
├── feast/
│ ├── feature_store.yaml # Feast project config
│ └── features/
│ └── taxi_features.py # Feature definitions
│
├── bentoml/
│ └── services/
│ └── taxi_predictor.py # Example BentoML service
│
├── marquez/
│ └── marquez.yml # Marquez server config
│
├── evidently/
│ └── config.yml # Evidently service config
│
├── monitoring/
│ └── prometheus/
│ └── prometheus.yml # UPDATED: adds all new scrape targets
│
├── grafana/
│ └── provisioning/
│ ├── datasources/
│ │ └── prometheus.yml
│ └── dashboards/
│ ├── dashboard.yml
│ └── sololakehouse_overview.json
│
├── notebooks/
│ └── 00_platform_smoke_test.ipynb # End-to-end verification notebook
│
├── data/ # Runtime data (gitignored)
└── logs/ # Runtime logs (gitignored)
3. Environment Variables — .env
File: .env
The agent must create this file. All ${VAR} references in docker-compose.yml resolve here. Replace placeholder values with real secrets before running.# ============================================================
# SoloLakehouse v2 — Environment Configuration
# ============================================================
# --- Object Storage (MinIO) ---
MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=CHANGE_ME_minio_secret
# --- Postgres Master Password ---
# Used for ALL database users (nessie, mlflow, airflow, feast, marquez)
PG_PASSWORD=CHANGE_ME_postgres_secret
# --- Grafana ---
GRAFANA_ADMIN_PASSWORD=CHANGE_ME_grafana_secret
# --- Airflow ---
# Generate: python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
AIRFLOW_FERNET_KEY=CHANGE_ME_run_above_command
# Generate: openssl rand -hex 32
AIRFLOW_SECRET_KEY=CHANGE_ME_run_above_command
AIRFLOW_ADMIN_PASSWORD=CHANGE_ME_airflow_secret
# --- Jupyter ---
JUPYTER_TOKEN=CHANGE_ME_jupyter_token
# --- Beszel (keep existing values) ---
BESZEL_TOKEN=a69c9561-adc8-4513-a4fa-92b109a80b6c
BESZEL_KEY=ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJisOmj8sQ1+uOmGjfF2mV1VEI5NurR4DI1BNXU3by2N
# --- Domain (used in MLflow allowed-hosts and NPM) ---
BASE_DOMAIN=sololake.space
4. Postgres Initialization
File: postgres/init/01_init_databases.sql
Postgres runs this file automatically on first container start (mount as/docker-entrypoint-initdb.d/). Creates all required databases and users with the singlePG_PASSWORD. The existingmetastoredatabase is created by the PostgresPOSTGRES_DBenv var — do NOT recreate it here.
-- ============================================================
-- SoloLakehouse v2 — Database Initialization
-- Runs once on first `docker compose up`
-- ============================================================
-- Nessie catalog metadata
CREATE DATABASE nessie;
CREATE USER nessie WITH PASSWORD 'CHANGE_ME_postgres_secret';
GRANT ALL PRIVILEGES ON DATABASE nessie TO nessie;
-- MLflow experiment tracking (replaces sqlite)
CREATE DATABASE mlflow;
CREATE USER mlflow WITH PASSWORD 'CHANGE_ME_postgres_secret';
GRANT ALL PRIVILEGES ON DATABASE mlflow TO mlflow;
-- Airflow metadata
CREATE DATABASE airflow;
CREATE USER airflow WITH PASSWORD 'CHANGE_ME_postgres_secret';
GRANT ALL PRIVILEGES ON DATABASE airflow TO airflow;
-- Feast feature registry
CREATE DATABASE feast;
CREATE USER feast WITH PASSWORD 'CHANGE_ME_postgres_secret';
GRANT ALL PRIVILEGES ON DATABASE feast TO feast;
-- Marquez data lineage
CREATE DATABASE marquez;
CREATE USER marquez WITH PASSWORD 'CHANGE_ME_postgres_secret';
GRANT ALL PRIVILEGES ON DATABASE marquez TO marquez;
IMPORTANT: The agent must also updatepostgres/pg_hba.confto allowmd5auth for all new users. Add the following lines to the existingpg_hba.conf, preserving the original content:
# SoloLakehouse v2 additional users
host nessie nessie 0.0.0.0/0 md5
host mlflow mlflow 0.0.0.0/0 md5
host airflow airflow 0.0.0.0/0 md5
host feast feast 0.0.0.0/0 md5
host marquez marquez 0.0.0.0/0 md5
5. Nessie Catalog Configuration
Project Nessie is the Git-like Iceberg REST catalog that replaces Hive Metastore. It stores metadata in Postgres and exposes an Iceberg REST endpoint consumed by Trino and Spark.
File: nessie/application.properties
# Nessie version store — backed by Postgres via JDBC
nessie.version.store.type=JDBC
quarkus.datasource.jdbc.url=jdbc:postgresql://postgres:5432/nessie
quarkus.datasource.username=nessie
quarkus.datasource.password=CHANGE_ME_postgres_secret
# HTTP
quarkus.http.port=19120
quarkus.http.host=0.0.0.0
# CORS — allow Trino, Spark, Jupyter from within docker network
quarkus.http.cors=true
quarkus.http.cors.origins=*
# Logging
quarkus.log.level=INFO
quarkus.log.category."org.projectnessie".level=INFO
Note: Nessie 0.95.0+ readsapplication.propertiesfrom the classpath or from a volume-mounted/deployments/config/application.properties. Mount path:./nessie/application.properties:/deployments/config/application.properties
6. Trino Configuration
6.1 trino/etc/config.properties
Keep the existing file. No changes required unless you want to increase memory.
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=4GB
query.max-memory-per-node=2GB
discovery.uri=http://localhost:8080
6.2 trino/etc/jvm.config
-server
-Xmx6G
-XX:InitialRAMPercentage=80
-XX:MaxRAMPercentage=80
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
-XX:ReservedCodeCacheSize=512M
-Djdk.attach.allowAttachSelf=true
-Dfile.encoding=UTF-8
6.3 trino/etc/node.properties
node.environment=sololakehouse
node.id=trino-node-1
node.data-dir=/data/trino
6.4 trino/etc/catalog/iceberg.properties ← NEW FILE
# Iceberg connector backed by Nessie REST catalog
connector.name=iceberg
# Catalog type: REST (Nessie implements the Iceberg REST spec)
iceberg.catalog.type=rest
iceberg.rest-catalog.uri=http://nessie:19120/api/v1
iceberg.rest-catalog.warehouse=s3://lakehouse/
# S3-compatible storage via MinIO
fs.native-s3.enabled=true
s3.endpoint=http://minio:9000
s3.region=us-east-1
s3.path-style-access=true
s3.aws-access-key=${ENV:MINIO_ROOT_USER}
s3.aws-secret-key=${ENV:MINIO_ROOT_PASSWORD}
# Performance
iceberg.file-format=PARQUET
iceberg.compression-codec=ZSTD
6.5 trino/etc/catalog/tpch.properties — keep existing
connector.name=tpch
7. Spark Configuration
7.1 spark/conf/spark-defaults.conf
# ── Iceberg Runtime ──────────────────────────────────────────────────────────
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.nessie=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.nessie.catalog-impl=org.apache.iceberg.nessie.NessieCatalog
spark.sql.catalog.nessie.uri=http://nessie:19120/api/v1
spark.sql.catalog.nessie.ref=main
spark.sql.catalog.nessie.warehouse=s3://lakehouse/
spark.sql.catalog.nessie.authentication.type=NONE
# ── MinIO / S3A ──────────────────────────────────────────────────────────────
spark.hadoop.fs.s3a.endpoint=http://minio:9000
spark.hadoop.fs.s3a.access.key=${MINIO_ROOT_USER}
spark.hadoop.fs.s3a.secret.key=${MINIO_ROOT_PASSWORD}
spark.hadoop.fs.s3a.path.style.access=true
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.connection.ssl.enabled=false
# ── MLflow Integration ───────────────────────────────────────────────────────
spark.mlflow.trackingUri=http://mlflow:5000
# ── OpenLineage (Marquez) ────────────────────────────────────────────────────
spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener
spark.openlineage.transport.type=http
spark.openlineage.transport.url=http://marquez:5000
spark.openlineage.namespace=sololakehouse
# ── JARs (must be present in spark image or mounted) ────────────────────────
# These are fetched by the custom Spark image defined in docker-compose
spark.jars.packages=org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1,\
org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.95.0,\
org.apache.hadoop:hadoop-aws:3.3.4,\
com.amazonaws:aws-java-sdk-bundle:1.12.262,\
io.openlineage:openlineage-spark_2.12:1.22.0
7.2 spark/conf/log4j2.properties
rootLogger.level=WARN
rootLogger.appenderRef.console.ref=ConsoleAppender
appender.console.type=Console
appender.console.name=ConsoleAppender
appender.console.layout.type=PatternLayout
appender.console.layout.pattern=%d{HH:mm:ss.SSS} [%t] %-5level %logger{36} - %msg%n
logger.iceberg.name=org.apache.iceberg
logger.iceberg.level=INFO
logger.spark.name=org.apache.spark
logger.spark.level=WARN
8. Airflow DAGs and Configuration
8.1 Airflow Environment Notes
Airflow runs with LocalExecutor backed by Postgres. The webserver and scheduler are separate containers sharing the same DAG volume mount. Both containers run from the same apache/airflow:2.10.0 image with additional pip packages installed via environment variable.
Required additional pip packages (set via _PIP_ADDITIONAL_REQUIREMENTS env var):
apache-airflow-providers-apache-spark==4.9.0
apache-airflow-providers-trino==5.7.0
openlineage-airflow==1.22.0
boto3==1.34.0
pyiceberg[glue,s3]==0.7.1
8.2 airflow/dags/fra_taxi_bronze_ingestion.py
This DAG simulates the Bronze ingestion layer for the Frankfurt Airport taxi time prediction project. In production, replace the synthetic data generator with real ADS-B feed or CSV uploads to MinIO.
"""
DAG: fra_taxi_bronze_ingestion
Layer: Bronze (raw ingestion)
Schedule: Daily at 02:00 UTC
Purpose: Ingest raw Frankfurt Airport taxi records into Iceberg Bronze table.
Lineage: Emits OpenLineage events to Marquez.
"""
from __future__ import annotations
import json
import random
from datetime import datetime, timedelta
import boto3
from airflow import DAG
from airflow.operators.python import PythonOperator
# ── Constants ────────────────────────────────────────────────────────────────
MINIO_ENDPOINT = "http://minio:9000"
MINIO_ACCESS_KEY = "{{ var.value.minio_access_key }}" # Set in Airflow Variables UI
MINIO_SECRET_KEY = "{{ var.value.minio_secret_key }}"
BRONZE_BUCKET = "lakehouse"
BRONZE_PREFIX = "bronze/fra_taxi_raw/"
DEFAULT_ARGS = {
"owner": "sololakehouse",
"retries": 2,
"retry_delay": timedelta(minutes=5),
"email_on_failure": False,
}
def generate_synthetic_taxi_records(**context) -> None:
"""
Generate synthetic FRA taxi time records and upload as newline-delimited JSON to MinIO.
Each record represents one aircraft taxi event from gate to runway or reverse.
"""
execution_date = context["ds"]
n_records = 200 # ~200 movements per day at FRA
gates = ["A01","A10","A22","B12","B20","C04","Z12","Z14","Z60"]
runways = ["07C","07L","07R","25C","25L","25R"]
records = []
for i in range(n_records):
record = {
"event_id": f"FRA-{execution_date}-{i:04d}",
"execution_date": execution_date,
"aircraft_type": random.choice(["A320","B737","A380","B777","E190","A350"]),
"gate": random.choice(gates),
"runway": random.choice(runways),
"taxi_direction": random.choice(["outbound", "inbound"]),
"taxi_time_sec": random.randint(120, 1800),
"wind_speed_kt": round(random.uniform(0, 40), 1),
"wind_dir_deg": random.randint(0, 359),
"temperature_c": round(random.uniform(-15, 38), 1),
"visibility_m": random.choice([800, 1500, 3000, 5000, 9999]),
"hour_utc": random.randint(5, 23),
"ingested_at": datetime.utcnow().isoformat(),
}
records.append(json.dumps(record))
payload = "\n".join(records)
key = f"{BRONZE_PREFIX}{execution_date}/records.jsonl"
s3 = boto3.client(
"s3",
endpoint_url=MINIO_ENDPOINT,
aws_access_key_id=MINIO_ACCESS_KEY,
aws_secret_access_key=MINIO_SECRET_KEY,
)
s3.put_object(Bucket=BRONZE_BUCKET, Key=key, Body=payload.encode("utf-8"))
print(f"[bronze] Uploaded {n_records} records to s3://{BRONZE_BUCKET}/{key}")
def register_iceberg_table_if_missing(**context) -> None:
"""
Create Iceberg Bronze table in Nessie catalog via Trino if it does not already exist.
This is idempotent — safe to run on every DAG execution.
"""
import trino
conn = trino.dbapi.connect(
host="trino",
port=8080,
user="airflow",
catalog="iceberg",
schema="bronze",
)
cur = conn.cursor()
cur.execute("CREATE SCHEMA IF NOT EXISTS iceberg.bronze WITH (location = 's3://lakehouse/bronze/')")
cur.execute("""
CREATE TABLE IF NOT EXISTS iceberg.bronze.fra_taxi_raw (
event_id VARCHAR,
execution_date VARCHAR,
aircraft_type VARCHAR,
gate VARCHAR,
runway VARCHAR,
taxi_direction VARCHAR,
taxi_time_sec INTEGER,
wind_speed_kt DOUBLE,
wind_dir_deg INTEGER,
temperature_c DOUBLE,
visibility_m INTEGER,
hour_utc INTEGER,
ingested_at VARCHAR
)
WITH (
format = 'PARQUET',
partitioning = ARRAY['execution_date'],
location = 's3://lakehouse/bronze/fra_taxi_raw/'
)
""")
print("[bronze] Iceberg table iceberg.bronze.fra_taxi_raw is ready.")
conn.close()
def load_jsonl_to_iceberg(**context) -> None:
"""
Read the JSONL file from MinIO and INSERT rows into the Iceberg Bronze table via Trino.
"""
import trino
import boto3
execution_date = context["ds"]
key = f"{BRONZE_PREFIX}{execution_date}/records.jsonl"
s3 = boto3.client(
"s3",
endpoint_url=MINIO_ENDPOINT,
aws_access_key_id=MINIO_ACCESS_KEY,
aws_secret_access_key=MINIO_SECRET_KEY,
)
body = s3.get_object(Bucket=BRONZE_BUCKET, Key=key)["Body"].read().decode()
records = [json.loads(line) for line in body.splitlines() if line.strip()]
conn = trino.dbapi.connect(host="trino", port=8080, user="airflow",
catalog="iceberg", schema="bronze")
cur = conn.cursor()
batch = []
for r in records:
batch.append(
f"('{r['event_id']}','{r['execution_date']}','{r['aircraft_type']}',"
f"'{r['gate']}','{r['runway']}','{r['taxi_direction']}',"
f"{r['taxi_time_sec']},{r['wind_speed_kt']},{r['wind_dir_deg']},"
f"{r['temperature_c']},{r['visibility_m']},{r['hour_utc']},'{r['ingested_at']}')"
)
values_sql = ",\n".join(batch)
cur.execute(f"""
INSERT INTO iceberg.bronze.fra_taxi_raw VALUES
{values_sql}
""")
print(f"[bronze] Inserted {len(records)} rows into iceberg.bronze.fra_taxi_raw "
f"partition execution_date={execution_date}")
conn.close()
with DAG(
dag_id="fra_taxi_bronze_ingestion",
default_args=DEFAULT_ARGS,
description="Bronze layer: ingest FRA taxi raw records into Iceberg",
schedule_interval="0 2 * * *",
start_date=datetime(2024, 1, 1),
catchup=False,
tags=["bronze", "fra", "taxi", "sololakehouse"],
) as dag:
t1 = PythonOperator(task_id="generate_records", python_callable=generate_synthetic_taxi_records)
t2 = PythonOperator(task_id="ensure_iceberg_table", python_callable=register_iceberg_table_if_missing)
t3 = PythonOperator(task_id="load_to_iceberg", python_callable=load_jsonl_to_iceberg)
t1 >> t2 >> t3
8.3 airflow/dags/fra_taxi_silver_transform.py
"""
DAG: fra_taxi_silver_transform
Layer: Silver (cleaned, feature-enriched)
Schedule: Daily at 04:00 UTC (after bronze)
Purpose: Clean bronze data, add derived features, write to Silver Iceberg table.
"""
from __future__ import annotations
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
DEFAULT_ARGS = {
"owner": "sololakehouse",
"retries": 2,
"retry_delay": timedelta(minutes=5),
"email_on_failure": False,
}
SILVER_DDL = """
CREATE TABLE IF NOT EXISTS iceberg.silver.fra_taxi_features (
event_id VARCHAR,
execution_date VARCHAR,
aircraft_type VARCHAR,
gate VARCHAR,
runway VARCHAR,
taxi_direction VARCHAR,
taxi_time_sec INTEGER,
taxi_time_min DOUBLE,
wind_speed_kt DOUBLE,
wind_dir_deg INTEGER,
temperature_c DOUBLE,
visibility_m INTEGER,
hour_utc INTEGER,
is_peak_hour BOOLEAN,
is_low_visibility BOOLEAN,
crosswind_component DOUBLE,
ingested_at VARCHAR,
transformed_at VARCHAR
)
WITH (
format = 'PARQUET',
partitioning = ARRAY['execution_date'],
location = 's3://lakehouse/silver/fra_taxi_features/'
)
"""
TRANSFORM_SQL = """
INSERT INTO iceberg.silver.fra_taxi_features
SELECT
event_id,
execution_date,
aircraft_type,
gate,
runway,
taxi_direction,
taxi_time_sec,
CAST(taxi_time_sec AS DOUBLE) / 60.0 AS taxi_time_min,
wind_speed_kt,
wind_dir_deg,
temperature_c,
visibility_m,
hour_utc,
hour_utc BETWEEN 6 AND 10 OR hour_utc BETWEEN 16 AND 20 AS is_peak_hour,
visibility_m < 1500 AS is_low_visibility,
-- Crosswind component approximation (simplified)
ABS(wind_speed_kt * SIN(RADIANS(wind_dir_deg - 70))) AS crosswind_component,
ingested_at,
CAST(NOW() AS VARCHAR) AS transformed_at
FROM iceberg.bronze.fra_taxi_raw
WHERE execution_date = '{execution_date}'
AND taxi_time_sec > 0
AND taxi_time_sec < 7200
"""
def ensure_silver_table(**context) -> None:
import trino
conn = trino.dbapi.connect(host="trino", port=8080, user="airflow",
catalog="iceberg", schema="silver")
cur = conn.cursor()
cur.execute("CREATE SCHEMA IF NOT EXISTS iceberg.silver WITH (location = 's3://lakehouse/silver/')")
cur.execute(SILVER_DDL)
conn.close()
def transform_to_silver(**context) -> None:
import trino
execution_date = context["ds"]
conn = trino.dbapi.connect(host="trino", port=8080, user="airflow",
catalog="iceberg", schema="silver")
cur = conn.cursor()
cur.execute(TRANSFORM_SQL.format(execution_date=execution_date))
conn.close()
print(f"[silver] Transformed records for {execution_date}")
with DAG(
dag_id="fra_taxi_silver_transform",
default_args=DEFAULT_ARGS,
description="Silver layer: clean + feature engineer FRA taxi data",
schedule_interval="0 4 * * *",
start_date=datetime(2024, 1, 1),
catchup=False,
tags=["silver", "fra", "taxi", "sololakehouse"],
) as dag:
t1 = PythonOperator(task_id="ensure_silver_table", python_callable=ensure_silver_table)
t2 = PythonOperator(task_id="transform_to_silver", python_callable=transform_to_silver)
t1 >> t2
9. MLflow — Upgrade to Postgres Backend
No separate config file is needed. The upgrade is done entirely via the docker-compose service definition (see Section 18). The key change is:
--backend-store-uriswitches fromsqlite:////mlflow/mlflow.dbtopostgresql://mlflow:${PG_PASSWORD}@postgres:5432/mlflow- The existing
./data/mlflowvolume is kept but the sqlite file is no longer used. - Add
pip install psycopg2-binary boto3to the startup command.
10. Feast Feature Store
10.1 feast/feature_store.yaml
project: sololakehouse
registry:
registry_type: sql
path: postgresql+psycopg2://feast:CHANGE_ME_postgres_secret@postgres:5432/feast
cache_ttl_seconds: 60
provider: local
online_store:
type: redis
connection_string: "redis:6379,db=0"
offline_store:
type: trino
host: trino
port: 8080
catalog: iceberg
connector:
type: iceberg
entity_key_serialization_version: 2
10.2 feast/features/taxi_features.py
"""
Feast Feature Definitions for FRA Taxi Time Prediction.
Run `feast apply` from the /feast directory to register these features.
"""
from datetime import timedelta
from feast import Entity, Feature, FeatureView, Field, FileSource
from feast.types import Float64, Int64, Bool, String
# ── Entity ───────────────────────────────────────────────────────────────────
taxi_event = Entity(
name="event_id",
description="Unique identifier for a single aircraft taxi event at FRA",
)
# ── Source (Iceberg table via Trino offline store) ────────────────────────────
fra_silver_source = FileSource(
# In production: use TrinoSource pointing to iceberg.silver.fra_taxi_features
# For local development, this points to a parquet export
path="s3://lakehouse/silver/fra_taxi_features/",
timestamp_field="ingested_at",
s3_endpoint_override="http://minio:9000",
)
# ── Feature View ──────────────────────────────────────────────────────────────
fra_taxi_feature_view = FeatureView(
name="fra_taxi_features",
entities=[taxi_event],
ttl=timedelta(days=90),
schema=[
Field(name="taxi_time_sec", dtype=Int64),
Field(name="taxi_time_min", dtype=Float64),
Field(name="wind_speed_kt", dtype=Float64),
Field(name="wind_dir_deg", dtype=Int64),
Field(name="temperature_c", dtype=Float64),
Field(name="visibility_m", dtype=Int64),
Field(name="hour_utc", dtype=Int64),
Field(name="is_peak_hour", dtype=Bool),
Field(name="is_low_visibility", dtype=Bool),
Field(name="crosswind_component", dtype=Float64),
Field(name="aircraft_type", dtype=String),
Field(name="taxi_direction", dtype=String),
],
source=fra_silver_source,
tags={"team": "ml-platform", "project": "fra-taxi-prediction"},
)
11. BentoML Model Serving
11.1 bentoml/services/taxi_predictor.py
"""
BentoML Service: FRA Taxi Time Predictor
Loads the latest production model from MLflow Model Registry and serves predictions.
Deploy with:
bentoml serve taxi_predictor:FRATaxiPredictor --port 3000
"""
from __future__ import annotations
import numpy as np
import pandas as pd
import bentoml
from bentoml.io import JSON
from pydantic import BaseModel
class TaxiPredictInput(BaseModel):
wind_speed_kt: float
wind_dir_deg: int
temperature_c: float
visibility_m: int
hour_utc: int
is_peak_hour: bool
is_low_visibility: bool
crosswind_component: float
aircraft_type: str # encoded as int in model; mapping below
taxi_direction: str # "inbound" -> 0, "outbound" -> 1
class TaxiPredictOutput(BaseModel):
predicted_taxi_time_sec: float
predicted_taxi_time_min: float
model_version: str
AIRCRAFT_TYPE_MAP = {"A320": 0, "B737": 1, "A380": 2, "B777": 3, "E190": 4, "A350": 5}
TAXI_DIRECTION_MAP = {"inbound": 0, "outbound": 1}
# ── Load model from MLflow Model Registry ────────────────────────────────────
# The model must first be registered via MLflow UI: Models > fra_taxi_predictor > Production
runner = bentoml.mlflow.get("fra_taxi_predictor:latest").to_runner()
svc = bentoml.Service("FRATaxiPredictor", runners=[runner])
@svc.api(input=JSON(pydantic_model=TaxiPredictInput),
output=JSON(pydantic_model=TaxiPredictOutput))
async def predict(data: TaxiPredictInput) -> TaxiPredictOutput:
features = pd.DataFrame([{
"wind_speed_kt": data.wind_speed_kt,
"wind_dir_deg": data.wind_dir_deg,
"temperature_c": data.temperature_c,
"visibility_m": data.visibility_m,
"hour_utc": data.hour_utc,
"is_peak_hour": int(data.is_peak_hour),
"is_low_visibility": int(data.is_low_visibility),
"crosswind_component": data.crosswind_component,
"aircraft_type_enc": AIRCRAFT_TYPE_MAP.get(data.aircraft_type, -1),
"taxi_direction_enc": TAXI_DIRECTION_MAP.get(data.taxi_direction, 0),
}])
result = await runner.async_run(features)
pred_sec = float(np.array(result).flatten()[0])
return TaxiPredictOutput(
predicted_taxi_time_sec=pred_sec,
predicted_taxi_time_min=pred_sec / 60.0,
model_version="latest",
)
12. Qdrant Vector Database
No custom config file is needed for basic Qdrant deployment. All configuration is done via environment variables in docker-compose. The data volume is ./data/qdrant.
Qdrant REST API will be available at http://localhost:6333. Qdrant gRPC API will be available at http://localhost:6334.
To create a collection for FinLakehouse RAG (run after startup):
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
client = QdrantClient(host="localhost", port=6333)
client.recreate_collection(
collection_name="fra_sop_documents",
vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)
13. Marquez — Data Lineage
13.1 marquez/marquez.yml
server:
applicationConnectors:
- type: http
port: 5000
adminConnectors:
- type: http
port: 5001
db:
driverClass: org.postgresql.Driver
url: jdbc:postgresql://postgres:5432/marquez
user: marquez
password: CHANGE_ME_postgres_secret
migrateOnStartup: true
tags: []
13.2 Airflow OpenLineage Integration
Add the following to the Airflow airflow-webserver and airflow-scheduler environment in docker-compose:
OPENLINEAGE_URL: http://marquez:5000
OPENLINEAGE_NAMESPACE: sololakehouse
AIRFLOW__LINEAGE__BACKEND: openlineage
No plugin file is needed — the openlineage-airflow pip package handles integration automatically.
14. Evidently AI — ML Monitoring
14.1 evidently/config.yml
# Evidently UI Server configuration
service:
host: 0.0.0.0
port: 8085
# Workspace directory (mapped to ./data/evidently inside container)
workspace_path: /app/workspace
# Enable all report types
ui:
show_all_reports: true
14.2 How to Add a Monitor (run after first model deployment)
# Run this script once to set up the FRA Taxi monitoring project
# Execute from within the jupyter container or locally with proper env vars
import evidently
from evidently.ui.workspace import Workspace
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, RegressionPreset
ws = Workspace("http://evidently:8085")
project = ws.create_project("FRA Taxi Time Predictor")
project.description = "Monitor data drift and regression quality for taxi time model"
project.save()
15. Jupyter Lab — Development Environment
15.1 Custom Startup Script
File: notebooks/startup.sh (mounted into Jupyter container)
#!/bin/bash
# Install additional packages not in the base pyspark-notebook image
pip install --quiet \
mlflow==2.16.0 \
feast==0.40.0 \
pyiceberg[s3]==0.7.1 \
trino[sqlalchemy]==0.329.0 \
qdrant-client==1.11.0 \
evidently==0.4.33 \
openlineage-python==1.22.0 \
bentoml==1.3.0 \
boto3==1.34.0 \
pandas==2.2.0 \
scikit-learn==1.5.0 \
xgboost==2.1.0
echo "SoloLakehouse v2 packages installed."
16. Prometheus Scrape Configuration
File: monitoring/prometheus/prometheus.yml
This replaces the existing file entirely. All previous targets are preserved and new ones added.
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: sololakehouse
scrape_configs:
# ── Infrastructure ──────────────────────────────────────────────────────────
- job_name: node-exporter
static_configs:
- targets: ['node-exporter:9100']
labels:
instance: sololakehouse-host
- job_name: cadvisor
static_configs:
- targets: ['cadvisor:8080']
labels:
instance: sololakehouse-containers
# ── Query Engine ─────────────────────────────────────────────────────────────
- job_name: trino
metrics_path: /metrics
static_configs:
- targets: ['trino:8080']
labels:
component: trino
# ── Spark ────────────────────────────────────────────────────────────────────
- job_name: spark-master
metrics_path: /metrics/prometheus
static_configs:
- targets: ['spark-master:8080']
labels:
component: spark-master
- job_name: spark-worker
metrics_path: /metrics/prometheus
static_configs:
- targets: ['spark-worker:8081']
labels:
component: spark-worker
# ── ML Platform ──────────────────────────────────────────────────────────────
- job_name: mlflow
metrics_path: /metrics
static_configs:
- targets: ['mlflow:5000']
labels:
component: mlflow
# ── Vector DB ────────────────────────────────────────────────────────────────
- job_name: qdrant
metrics_path: /metrics
static_configs:
- targets: ['qdrant:6333']
labels:
component: qdrant
# ── Nessie ───────────────────────────────────────────────────────────────────
- job_name: nessie
metrics_path: /q/metrics
static_configs:
- targets: ['nessie:19120']
labels:
component: nessie
17. Grafana Provisioning
17.1 grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
17.2 grafana/provisioning/dashboards/dashboard.yml
apiVersion: 1
providers:
- name: SoloLakehouse
folder: SoloLakehouse
type: file
options:
path: /etc/grafana/provisioning/dashboards
17.3 grafana/provisioning/dashboards/sololakehouse_overview.json
{
"__inputs": [],
"__requires": [
{ "type": "datasource", "id": "prometheus", "version": "1.0.0" }
],
"annotations": { "list": [] },
"description": "SoloLakehouse v2 — Platform Overview",
"editable": true,
"graphTooltip": 1,
"id": null,
"panels": [
{
"datasource": "Prometheus",
"fieldConfig": { "defaults": { "unit": "percent" } },
"gridPos": { "h": 4, "w": 6, "x": 0, "y": 0 },
"id": 1,
"title": "Host CPU Usage",
"type": "stat",
"targets": [{
"expr": "100 - (avg(rate(node_cpu_seconds_total{mode='idle'}[5m])) * 100)",
"legendFormat": "CPU %"
}]
},
{
"datasource": "Prometheus",
"fieldConfig": { "defaults": { "unit": "bytes" } },
"gridPos": { "h": 4, "w": 6, "x": 6, "y": 0 },
"id": 2,
"title": "Host Memory Used",
"type": "stat",
"targets": [{
"expr": "node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes",
"legendFormat": "Memory Used"
}]
},
{
"datasource": "Prometheus",
"gridPos": { "h": 4, "w": 6, "x": 12, "y": 0 },
"id": 3,
"title": "Running Containers",
"type": "stat",
"targets": [{
"expr": "count(container_last_seen{container!='',image!=''})",
"legendFormat": "Containers"
}]
}
],
"schemaVersion": 38,
"title": "SoloLakehouse v2 Overview",
"uid": "sololakehouse-overview",
"version": 1
}
18. Complete docker-compose.yml v2
This is the complete replacement for the existing docker-compose.yml. Every service from v1 is preserved. New services are added. Changed services are noted inline.# ============================================================
# SoloLakehouse v2 — Complete Docker Compose
# ============================================================
x-airflow-env: &airflow-env
AIRFLOW__CORE__EXECUTOR: LocalExecutor
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:${PG_PASSWORD}@postgres:5432/airflow
AIRFLOW__CORE__FERNET_KEY: ${AIRFLOW_FERNET_KEY}
AIRFLOW__WEBSERVER__SECRET_KEY: ${AIRFLOW_SECRET_KEY}
AIRFLOW__CORE__DAGS_FOLDER: /opt/airflow/dags
AIRFLOW__CORE__LOAD_EXAMPLES: "false"
AIRFLOW__LINEAGE__BACKEND: openlineage
OPENLINEAGE_URL: http://marquez:5000
OPENLINEAGE_NAMESPACE: sololakehouse
MLFLOW_TRACKING_URI: http://mlflow:5000
MLFLOW_S3_ENDPOINT_URL: http://minio:9000
AWS_ACCESS_KEY_ID: ${MINIO_ROOT_USER}
AWS_SECRET_ACCESS_KEY: ${MINIO_ROOT_PASSWORD}
_PIP_ADDITIONAL_REQUIREMENTS: >-
apache-airflow-providers-apache-spark==4.9.0
apache-airflow-providers-trino==5.7.0
openlineage-airflow==1.22.0
boto3==1.34.0
trino==0.329.0
services:
# =========================
# Nginx Proxy Manager
# (UNCHANGED from v1)
# =========================
npm:
image: jc21/nginx-proxy-manager:latest
container_name: sololakehouse-npm
restart: unless-stopped
ports:
- "80:80"
- "443:443"
- "127.0.0.1:81:81"
volumes:
- ./data/npm/data:/data
- ./data/npm/letsencrypt:/etc/letsencrypt
- ./logs/npm:/var/log/nginx
networks:
- lake_net
# =========================
# MinIO Object Storage
# (UNCHANGED from v1)
# =========================
minio:
image: minio/minio:RELEASE.2024-11-07T00-52-20Z
container_name: sololakehouse-minio
restart: unless-stopped
expose:
- "9000"
- "9001"
environment:
MINIO_ROOT_USER: ${MINIO_ROOT_USER}
MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD}
command: server /data --console-address ":9001"
volumes:
- ./data/minio:/data
networks:
- lake_net
# =========================
# MinIO Init (one-shot bucket creation)
# NEW in v2
# =========================
minio-init:
image: minio/mc:latest
container_name: sololakehouse-minio-init
restart: "no"
depends_on:
- minio
entrypoint: >
/bin/sh -c "
sleep 5;
mc alias set local http://minio:9000 ${MINIO_ROOT_USER} ${MINIO_ROOT_PASSWORD};
mc mb -p local/lakehouse;
mc mb -p local/mlflow;
mc mb -p local/airflow-logs;
mc mb -p local/feast-offline;
echo 'MinIO buckets created.';
"
networks:
- lake_net
# =========================
# Postgres
# CHANGED in v2: added init SQL directory
# =========================
postgres:
image: postgres:16-alpine
container_name: sololakehouse-postgres
restart: unless-stopped
command:
[
"postgres",
"-c", "password_encryption=md5",
"-c", "hba_file=/etc/postgres/pg_hba.conf"
]
environment:
POSTGRES_DB: metastore
POSTGRES_USER: metastore
POSTGRES_PASSWORD: ${PG_PASSWORD}
volumes:
- ./data/postgres:/var/lib/postgresql/data
- ./postgres/pg_hba.conf:/etc/postgres/pg_hba.conf:ro
- ./postgres/init:/docker-entrypoint-initdb.d:ro
expose:
- "5432"
networks:
- lake_net
# =========================
# Hive Metastore
# KEPT for backward compatibility (Trino hive catalog still works)
# Primary catalog is now Nessie — see iceberg.properties
# =========================
hive-metastore:
image: apache/hive:3.1.3
container_name: sololakehouse-hive-metastore
restart: unless-stopped
depends_on:
- postgres
environment:
- HADOOP_CLIENT_OPTS=-Xmx1G
volumes:
- ./hive/conf:/opt/hive/conf
- ./hive/lib:/opt/hive/lib/custom
- ./data/hive-metastore/warehouse:/warehouse
expose:
- "9083"
networks:
- lake_net
entrypoint: /bin/bash
command:
- -c
- >
export HADOOP_CLASSPATH=/opt/hive/lib/custom/*:$HADOOP_CLASSPATH;
/opt/hive/bin/schematool -dbType postgres -initSchema --verbose || true;
/opt/hive/bin/hive --service metastore
# =========================
# Project Nessie (Git-like Iceberg REST Catalog)
# NEW in v2 — replaces Hive as primary catalog
# =========================
nessie:
image: projectnessie/nessie:0.95.0
container_name: sololakehouse-nessie
restart: unless-stopped
depends_on:
- postgres
environment:
nessie.version.store.type: JDBC
quarkus.datasource.jdbc.url: jdbc:postgresql://postgres:5432/nessie
quarkus.datasource.username: nessie
quarkus.datasource.password: ${PG_PASSWORD}
quarkus.http.port: "19120"
quarkus.http.host: "0.0.0.0"
quarkus.http.cors: "true"
quarkus.http.cors.origins: "*"
expose:
- "19120"
networks:
- lake_net
# =========================
# Trino (SQL Query Engine)
# CHANGED in v2: now uses Iceberg + Nessie catalog
# =========================
trino:
image: trinodb/trino:455
container_name: sololakehouse-trino
restart: unless-stopped
depends_on:
- nessie
- minio
- hive-metastore
ports:
- "127.0.0.1:8080:8080"
volumes:
- ./trino/etc:/etc/trino
environment:
MINIO_ROOT_USER: ${MINIO_ROOT_USER}
MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD}
networks:
- lake_net
# =========================
# Apache Spark (Standalone Cluster)
# NEW in v2 — Databricks compute layer equivalent
# =========================
spark-master:
image: bitnami/spark:3.5.3
container_name: sololakehouse-spark-master
restart: unless-stopped
environment:
SPARK_MODE: master
SPARK_MASTER_HOST: spark-master
SPARK_RPC_AUTHENTICATION_ENABLED: "no"
SPARK_RPC_ENCRYPTION_ENABLED: "no"
AWS_ACCESS_KEY_ID: ${MINIO_ROOT_USER}
AWS_SECRET_ACCESS_KEY: ${MINIO_ROOT_PASSWORD}
expose:
- "7077"
ports:
- "127.0.0.1:4040:8080"
volumes:
- ./spark/conf/spark-defaults.conf:/opt/bitnami/spark/conf/spark-defaults.conf:ro
- ./spark/conf/log4j2.properties:/opt/bitnami/spark/conf/log4j2.properties:ro
- ./data/spark-events:/tmp/spark-events
networks:
- lake_net
spark-worker:
image: bitnami/spark:3.5.3
container_name: sololakehouse-spark-worker
restart: unless-stopped
depends_on:
- spark-master
environment:
SPARK_MODE: worker
SPARK_MASTER_URL: spark://spark-master:7077
SPARK_WORKER_CORES: "4"
SPARK_WORKER_MEMORY: 8G
AWS_ACCESS_KEY_ID: ${MINIO_ROOT_USER}
AWS_SECRET_ACCESS_KEY: ${MINIO_ROOT_PASSWORD}
expose:
- "8081"
volumes:
- ./spark/conf/spark-defaults.conf:/opt/bitnami/spark/conf/spark-defaults.conf:ro
- ./data/spark-events:/tmp/spark-events
networks:
- lake_net
# =========================
# CloudBeaver (Web SQL IDE)
# (UNCHANGED from v1)
# =========================
cloudbeaver:
image: dbeaver/cloudbeaver:25.3.3
container_name: sololakehouse-cloudbeaver
restart: unless-stopped
depends_on:
- trino
expose:
- "8978"
volumes:
- ./data/cloudbeaver:/opt/cloudbeaver/workspace
networks:
- lake_net
# =========================
# MLflow
# CHANGED in v2: Postgres backend instead of SQLite
# =========================
mlflow:
image: ghcr.io/mlflow/mlflow:v2.16.0
container_name: sololakehouse-mlflow
restart: unless-stopped
depends_on:
- postgres
- minio
ports:
- "127.0.0.1:5000:5000"
environment:
MLFLOW_S3_ENDPOINT_URL: http://minio:9000
AWS_ACCESS_KEY_ID: ${MINIO_ROOT_USER}
AWS_SECRET_ACCESS_KEY: ${MINIO_ROOT_PASSWORD}
AWS_DEFAULT_REGION: us-east-1
MLFLOW_S3_IGNORE_TLS: "true"
volumes:
- ./data/mlflow:/mlflow
command:
- /bin/sh
- -lc
- >
pip install --quiet psycopg2-binary boto3 &&
mlflow server
--host 0.0.0.0
--port 5000
--backend-store-uri postgresql://mlflow:${PG_PASSWORD}@postgres:5432/mlflow
--default-artifact-root s3://mlflow
--allowed-hosts solo-mlflow.sololake.space,localhost,127.0.0.1
networks:
- lake_net
# =========================
# Airflow Webserver
# NEW in v2 — Databricks Workflows equivalent
# =========================
airflow-webserver:
image: apache/airflow:2.10.0
container_name: sololakehouse-airflow-web
restart: unless-stopped
depends_on:
- postgres
- minio
environment:
<<: *airflow-env
_AIRFLOW_WWW_USER_CREATE: "true"
_AIRFLOW_WWW_USER_USERNAME: admin
_AIRFLOW_WWW_USER_PASSWORD: ${AIRFLOW_ADMIN_PASSWORD}
ports:
- "127.0.0.1:8083:8080"
volumes:
- ./airflow/dags:/opt/airflow/dags
- ./airflow/plugins:/opt/airflow/plugins
- ./data/airflow:/opt/airflow/logs
command: webserver
networks:
- lake_net
airflow-scheduler:
image: apache/airflow:2.10.0
container_name: sololakehouse-airflow-scheduler
restart: unless-stopped
depends_on:
- airflow-webserver
environment:
<<: *airflow-env
volumes:
- ./airflow/dags:/opt/airflow/dags
- ./airflow/plugins:/opt/airflow/plugins
- ./data/airflow:/opt/airflow/logs
command: scheduler
networks:
- lake_net
airflow-init:
image: apache/airflow:2.10.0
container_name: sololakehouse-airflow-init
restart: "no"
depends_on:
- postgres
environment:
<<: *airflow-env
command: db migrate
networks:
- lake_net
# =========================
# Redis (Online store for Feast)
# NEW in v2
# =========================
redis:
image: redis:7.4-alpine
container_name: sololakehouse-redis
restart: unless-stopped
expose:
- "6379"
volumes:
- ./data/redis:/data
command: redis-server --appendonly yes
networks:
- lake_net
# =========================
# Feast Feature Server
# NEW in v2 — Databricks Feature Engineering equivalent
# =========================
feast:
image: feastdev/feature-server:0.40.0
container_name: sololakehouse-feast
restart: unless-stopped
depends_on:
- postgres
- redis
- trino
environment:
FEAST_USAGE: "False"
volumes:
- ./feast/feature_store.yaml:/app/feature_store.yaml:ro
- ./feast/features:/app/features:ro
ports:
- "127.0.0.1:6566:6566"
command: feast serve --host 0.0.0.0 --port 6566
networks:
- lake_net
# =========================
# BentoML Model Serving
# NEW in v2 — Databricks Model Serving equivalent
# =========================
bentoml:
image: python:3.11-slim
container_name: sololakehouse-bentoml
restart: unless-stopped
depends_on:
- mlflow
environment:
MLFLOW_TRACKING_URI: http://mlflow:5000
MLFLOW_S3_ENDPOINT_URL: http://minio:9000
AWS_ACCESS_KEY_ID: ${MINIO_ROOT_USER}
AWS_SECRET_ACCESS_KEY: ${MINIO_ROOT_PASSWORD}
BENTOML_HOME: /bentoml
volumes:
- ./data/bentoml:/bentoml
- ./bentoml/services:/services
ports:
- "127.0.0.1:3001:3000"
command: >
/bin/sh -c "
pip install --quiet bentoml==1.3.0 mlflow==2.16.0 boto3 scikit-learn pandas numpy &&
cd /services &&
bentoml serve taxi_predictor:FRATaxiPredictor --host 0.0.0.0 --port 3000 --reload
"
networks:
- lake_net
# =========================
# Qdrant Vector Database
# NEW in v2 — Databricks Vector Search equivalent
# =========================
qdrant:
image: qdrant/qdrant:v1.11.0
container_name: sololakehouse-qdrant
restart: unless-stopped
expose:
- "6333"
- "6334"
ports:
- "127.0.0.1:6333:6333"
volumes:
- ./data/qdrant:/qdrant/storage
networks:
- lake_net
# =========================
# Marquez (Data Lineage)
# NEW in v2 — Unity Catalog Lineage equivalent
# =========================
marquez:
image: marquezproject/marquez:0.50.0
container_name: sololakehouse-marquez
restart: unless-stopped
depends_on:
- postgres
volumes:
- ./marquez/marquez.yml:/opt/marquez/marquez.yml:ro
expose:
- "5000"
- "5001"
ports:
- "127.0.0.1:3002:5000"
command: ["--config", "/opt/marquez/marquez.yml"]
networks:
- lake_net
marquez-web:
image: marquezproject/marquez-web:0.50.0
container_name: sololakehouse-marquez-web
restart: unless-stopped
depends_on:
- marquez
environment:
MARQUEZ_HOST: marquez
MARQUEZ_PORT: "5000"
expose:
- "3000"
ports:
- "127.0.0.1:3003:3000"
networks:
- lake_net
# =========================
# Evidently AI (ML Monitoring)
# NEW in v2 — Lakehouse Monitoring equivalent
# =========================
evidently:
image: evidently/evidently-service:latest
container_name: sololakehouse-evidently
restart: unless-stopped
volumes:
- ./data/evidently:/app/workspace
- ./evidently/config.yml:/app/config.yml:ro
ports:
- "127.0.0.1:8085:8085"
networks:
- lake_net
# =========================
# Jupyter Lab (Development Environment)
# NEW in v2
# =========================
jupyter:
image: jupyter/pyspark-notebook:spark-3.5.0
container_name: sololakehouse-jupyter
restart: unless-stopped
depends_on:
- spark-master
- mlflow
- minio
- trino
environment:
JUPYTER_ENABLE_LAB: "yes"
JUPYTER_TOKEN: ${JUPYTER_TOKEN}
MLFLOW_TRACKING_URI: http://mlflow:5000
MLFLOW_S3_ENDPOINT_URL: http://minio:9000
AWS_ACCESS_KEY_ID: ${MINIO_ROOT_USER}
AWS_SECRET_ACCESS_KEY: ${MINIO_ROOT_PASSWORD}
SPARK_MASTER: spark://spark-master:7077
OPENLINEAGE_URL: http://marquez:5000
volumes:
- ./notebooks:/home/jovyan/work
- ./spark/conf/spark-defaults.conf:/usr/local/spark/conf/spark-defaults.conf:ro
- ./notebooks/startup.sh:/usr/local/bin/before-notebook.d/startup.sh:ro
ports:
- "127.0.0.1:8888:8888"
networks:
- lake_net
# =========================
# Beszel Hub
# (UNCHANGED from v1)
# =========================
beszel:
image: henrygd/beszel:latest
container_name: sololakehouse-beszel
restart: unless-stopped
ports:
- "127.0.0.1:8090:8090"
environment:
APP_URL: http://localhost:8090
volumes:
- ./data/beszel/hub:/beszel_data
- ./data/beszel/socket:/beszel_socket
networks:
- lake_net
beszel-agent:
image: henrygd/beszel-agent:latest
container_name: sololakehouse-beszel-agent
restart: unless-stopped
network_mode: host
volumes:
- ./data/beszel/agent:/var/lib/beszel-agent
- ./data/beszel/socket:/beszel_socket
- /var/run/docker.sock:/var/run/docker.sock:ro
environment:
LISTEN: /beszel_socket/beszel.sock
HUB_URL: http://localhost:8090
TOKEN: ${BESZEL_TOKEN}
KEY: ${BESZEL_KEY}
# =========================
# Prometheus
# CHANGED in v2: new prometheus.yml with more scrape targets
# =========================
prometheus:
image: prom/prometheus:latest
container_name: sololakehouse-prometheus
restart: unless-stopped
ports:
- "127.0.0.1:9090:9090"
volumes:
- ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./data/prometheus:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=30d"
networks:
- lake_net
# =========================
# Grafana
# CHANGED in v2: provisioning mounts added
# =========================
grafana:
image: grafana/grafana:latest
container_name: sololakehouse-grafana
restart: unless-stopped
depends_on:
- prometheus
ports:
- "127.0.0.1:3000:3000"
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD}
GF_SERVER_SERVE_FROM_SUB_PATH: "false"
volumes:
- ./data/grafana:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
networks:
- lake_net
# =========================
# Node Exporter
# (UNCHANGED from v1)
# =========================
node-exporter:
image: prom/node-exporter:latest
container_name: sololakehouse-node-exporter
restart: unless-stopped
expose:
- "9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- "--path.procfs=/host/proc"
- "--path.sysfs=/host/sys"
- "--path.rootfs=/rootfs"
networks:
- lake_net
# =========================
# cAdvisor
# (UNCHANGED from v1)
# =========================
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: sololakehouse-cadvisor
restart: unless-stopped
expose:
- "8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
networks:
- lake_net
# =========================
# MkDocs Material
# (UNCHANGED from v1)
# =========================
mkdocs:
image: squidfunk/mkdocs-material:latest
container_name: sololakehouse-mkdocs
restart: unless-stopped
ports:
- "127.0.0.1:8000:8000"
volumes:
- ./docs:/docs
- ./docs/data/mkdocs-cache:/root/.cache
environment:
- PYTHONUNBUFFERED=1
command:
- serve
- --dev-addr=0.0.0.0:8000
- --livereload
- --dirty
networks:
- lake_net
networks:
lake_net:
name: sololakehouse_lake_net
driver: bridge
19. MinIO Bucket Bootstrap Script
File: scripts/bootstrap.sh
This script is idempotent. Run it once afterdocker compose up -dto verify the platform is healthy. It does NOT replace theminio-initcontainer (which runs automatically); it adds post-startup verifications and Airflow variable seeding.
#!/usr/bin/env bash
set -euo pipefail
# Load .env
set -a; source .env; set +a
echo "======================================================"
echo " SoloLakehouse v2 — Bootstrap & Verification Script"
echo "======================================================"
# ── 1. Wait for Postgres ──────────────────────────────────────────────────────
echo "[1/7] Waiting for Postgres..."
until docker exec sololakehouse-postgres pg_isready -U metastore > /dev/null 2>&1; do
sleep 2
done
echo " Postgres: OK"
# ── 2. Wait for Nessie ───────────────────────────────────────────────────────
echo "[2/7] Waiting for Nessie REST catalog..."
until curl -sf http://localhost:19120/api/v1/config > /dev/null 2>&1; do
# Nessie is exposed internally; access via Trino container
sleep 3
done || echo " Nessie: not directly accessible from host (expected) — OK via Trino"
# ── 3. Wait for Trino ────────────────────────────────────────────────────────
echo "[3/7] Waiting for Trino..."
until curl -sf http://localhost:8080/v1/info > /dev/null 2>&1; do
sleep 3
done
echo " Trino: OK"
# ── 4. Verify Iceberg catalog in Trino ───────────────────────────────────────
echo "[4/7] Verifying Iceberg catalog via Trino..."
docker exec sololakehouse-trino \
trino --execute "SHOW CATALOGS" 2>/dev/null | grep -q "iceberg" \
&& echo " iceberg catalog: OK" \
|| echo " WARNING: iceberg catalog not yet available — check nessie logs"
# ── 5. Wait for MLflow ───────────────────────────────────────────────────────
echo "[5/7] Waiting for MLflow..."
until curl -sf http://localhost:5000/health > /dev/null 2>&1; do
sleep 3
done
echo " MLflow: OK"
# ── 6. Wait for Airflow ──────────────────────────────────────────────────────
echo "[6/7] Waiting for Airflow Webserver..."
until curl -sf http://localhost:8083/health > /dev/null 2>&1; do
sleep 5
done
echo " Airflow: OK"
# Seed Airflow Variables for DAGs
echo " Seeding Airflow Variables..."
docker exec sololakehouse-airflow-web \
airflow variables set minio_access_key "${MINIO_ROOT_USER}" || true
docker exec sololakehouse-airflow-web \
airflow variables set minio_secret_key "${MINIO_ROOT_PASSWORD}" || true
# ── 7. Create Iceberg schemas ────────────────────────────────────────────────
echo "[7/7] Creating Iceberg Bronze/Silver/Gold schemas..."
docker exec sololakehouse-trino trino --execute \
"CREATE SCHEMA IF NOT EXISTS iceberg.bronze WITH (location = 's3://lakehouse/bronze/')" \
2>/dev/null && echo " bronze: OK" || true
docker exec sololakehouse-trino trino --execute \
"CREATE SCHEMA IF NOT EXISTS iceberg.silver WITH (location = 's3://lakehouse/silver/')" \
2>/dev/null && echo " silver: OK" || true
docker exec sololakehouse-trino trino --execute \
"CREATE SCHEMA IF NOT EXISTS iceberg.gold WITH (location = 's3://lakehouse/gold/')" \
2>/dev/null && echo " gold: OK" || true
echo ""
echo "======================================================"
echo " Bootstrap complete. SoloLakehouse v2 is ready."
echo "======================================================"
echo ""
echo " MLflow: http://localhost:5000"
echo " Airflow: http://localhost:8083"
echo " Trino: http://localhost:8080"
echo " Jupyter: http://localhost:8888"
echo " Grafana: http://localhost:3000"
echo " Qdrant: http://localhost:6333"
echo " Marquez: http://localhost:3002"
echo " Evidently: http://localhost:8085"
echo ""
20. Airflow Bootstrap DAG — Frankfurt Airport Taxi Pipeline
The two DAG files are specified in Section 8. Place both files in ./airflow/dags/. They will be auto-discovered by the scheduler.21. Sample Notebook — End-to-End Platform Smoke Test
File: notebooks/00_platform_smoke_test.ipynb
Create this as a standard Jupyter notebook with the following cells. Each cell corresponds to one platform component.
# Cell 1 — Title
# SoloLakehouse v2 — End-to-End Platform Smoke Test
# Run this notebook to verify all platform components are working together.
# Cell 2 — MinIO connectivity
import boto3
s3 = boto3.client(
"s3",
endpoint_url="http://minio:9000",
aws_access_key_id="minioadmin", # replace with ${MINIO_ROOT_USER}
aws_secret_access_key="CHANGE_ME", # replace with ${MINIO_ROOT_PASSWORD}
)
buckets = [b["Name"] for b in s3.list_buckets()["Buckets"]]
print("MinIO buckets:", buckets)
assert "lakehouse" in buckets, "lakehouse bucket missing"
assert "mlflow" in buckets, "mlflow bucket missing"
print("✅ MinIO: OK")
# Cell 3 — Trino + Iceberg
import trino
conn = trino.dbapi.connect(host="trino", port=8080, user="smoke-test")
cur = conn.cursor()
cur.execute("SHOW CATALOGS")
catalogs = [row[0] for row in cur.fetchall()]
print("Trino catalogs:", catalogs)
assert "iceberg" in catalogs, "iceberg catalog not found in Trino"
print("✅ Trino + Iceberg catalog: OK")
# Cell 4 — Write an Iceberg table via Trino
cur.execute("""
CREATE TABLE IF NOT EXISTS iceberg.bronze.smoke_test (
id INTEGER, value VARCHAR
) WITH (format = 'PARQUET', location = 's3://lakehouse/bronze/smoke_test/')
""")
cur.execute("INSERT INTO iceberg.bronze.smoke_test VALUES (1, 'hello_sololakehouse')")
cur.execute("SELECT * FROM iceberg.bronze.smoke_test")
rows = cur.fetchall()
print("Iceberg write/read:", rows)
assert rows[0][1] == "hello_sololakehouse"
print("✅ Iceberg ACID write via Trino: OK")
# Cell 5 — PySpark with Iceberg + Nessie
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.appName("smoke-test")
.master("spark://spark-master:7077")
.getOrCreate())
df = spark.createDataFrame([(1, "spark-test")], ["id", "name"])
df.writeTo("nessie.bronze.spark_smoke_test").createOrReplace()
result = spark.table("nessie.bronze.spark_smoke_test").collect()
print("Spark + Nessie write:", result)
print("✅ PySpark + Nessie Iceberg: OK")
spark.stop()
# Cell 6 — MLflow
import mlflow
mlflow.set_tracking_uri("http://mlflow:5000")
mlflow.set_experiment("smoke-test")
with mlflow.start_run(run_name="platform-verify"):
mlflow.log_param("component", "sololakehouse-v2")
mlflow.log_metric("smoke_test_passed", 1.0)
print("✅ MLflow experiment tracking: OK")
# Cell 7 — Qdrant
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
qc = QdrantClient(host="qdrant", port=6333)
qc.recreate_collection(
collection_name="smoke_test",
vectors_config=VectorParams(size=4, distance=Distance.COSINE)
)
collections = [c.name for c in qc.get_collections().collections]
print("Qdrant collections:", collections)
assert "smoke_test" in collections
print("✅ Qdrant: OK")
# Cell 8 — Summary
print("")
print("══════════════════════════════════")
print(" SoloLakehouse v2 Smoke Test")
print(" ALL COMPONENTS: PASSED ✅")
print("══════════════════════════════════")
22. Startup Sequence
The agent must execute these commands in order on the target host.
# Step 1: Clone / copy the repository
cd /opt/sololakehouse # or wherever you keep the project
# Step 2: Create .env and fill in all CHANGE_ME values
cp .env.template .env
# Edit .env manually or via sed for CI environments
# Step 3: Create all required data directories
mkdir -p data/{npm/{data,letsencrypt},minio,postgres,hive-metastore/warehouse}
mkdir -p data/{mlflow,cloudbeaver,grafana,prometheus,beszel/{hub,agent,socket}}
mkdir -p data/{redis,qdrant,evidently,bentoml,airflow}
mkdir -p data/{spark-events,nessie}
mkdir -p logs/npm
# Step 4: Pull all images in parallel (saves startup time)
docker compose pull
# Step 5: Start infrastructure tier first (postgres, minio, nessie)
docker compose up -d postgres minio
sleep 10
# Step 6: Run minio-init and airflow-init
docker compose up -d minio-init airflow-init
sleep 15
# Step 7: Start catalog and compute tier
docker compose up -d nessie hive-metastore
sleep 15
# Step 8: Start all remaining services
docker compose up -d
# Step 9: Run bootstrap script
chmod +x scripts/bootstrap.sh
./scripts/bootstrap.sh
# Step 10: Trigger the first Airflow DAG manually to verify E2E pipeline
docker exec sololakehouse-airflow-web \
airflow dags trigger fra_taxi_bronze_ingestion
23. Verification Checklist
After running the startup sequence, verify each item manually:
| # | Check | Command / URL | Expected Result |
|---|---|---|---|
| 1 | Postgres databases exist | docker exec sololakehouse-postgres psql -U metastore -c "\l" |
Lists: metastore, nessie, mlflow, airflow, feast, marquez |
| 2 | MinIO buckets | http://localhost:9001 (Console) | Buckets: lakehouse, mlflow, airflow-logs, feast-offline |
| 3 | Nessie REST | curl http://localhost:19120/api/v1/config via Trino container |
JSON response with defaultBranch: main |
| 4 | Trino Iceberg catalog | docker exec sololakehouse-trino trino --execute "SHOW CATALOGS" |
Includes iceberg |
| 5 | Iceberg schemas | docker exec sololakehouse-trino trino --execute "SHOW SCHEMAS IN iceberg" |
Includes bronze, silver, gold |
| 6 | Spark cluster | http://localhost:4040 | 1 worker registered, status ALIVE |
| 7 | MLflow Postgres | http://localhost:5000 | UI loads; no sqlite errors in logs |
| 8 | Airflow | http://localhost:8083 | DAGs fra_taxi_bronze_ingestion and fra_taxi_silver_transform visible |
| 9 | Feast | curl http://localhost:6566/health |
{"status": "up"} |
| 10 | Qdrant | http://localhost:6333/dashboard | Dashboard loads |
| 11 | Marquez | http://localhost:3002/api/v1/namespaces | JSON with namespaces list |
| 12 | Evidently | http://localhost:8085 | UI loads |
| 13 | Jupyter | http://localhost:8888 | Lab interface loads with token |
| 14 | Grafana | http://localhost:3000 | Login works; SoloLakehouse dashboard visible |
| 15 | E2E Pipeline | Run smoke test notebook | All cells pass with ✅ |
Appendix A — Key Architecture Decisions (ADR Summary)
| Decision | Chosen | Rejected | Reason |
|---|---|---|---|
| Table Format | Apache Iceberg | Delta Lake | Iceberg is vendor-neutral; Nessie only supports Iceberg |
| Catalog | Project Nessie | Apache Atlas, OpenMetadata | Nessie implements Iceberg REST spec natively; git-like branching |
| Orchestration | Apache Airflow | Prefect, Dagster | Widest adoption, best provider ecosystem, Databricks Workflows mental model |
| Feature Store | Feast | Hopsworks, custom Delta tables | Feast is purpose-built, has Trino offline store support, Redis online store |
| Model Serving | BentoML | Seldon, Ray Serve | BentoML has native MLflow Registry integration and simplest Docker deployment |
| Vector DB | Qdrant | Weaviate, Milvus | Best performance/resource ratio for single-node self-hosted deployment |
| Lineage | Marquez + OpenLineage | Apache Atlas | Marquez is the reference OpenLineage server; Airflow + Spark have first-class plugins |
| ML Monitoring | Evidently | Whylogs, Alibi Detect | Best UI, most active open-source community, Grafana-compatible metrics export |
End of SoloLakehouse v2 Build Specification