SoloLakehouse -- 开源 Portfolio 项目打包方案

SoloLakehouse -- 开源 Portfolio 项目打包方案

一、定位:这不只是 Side Project,是你的 Platform Engineering 名片

招聘经理视角

一个 100k+ Platform / Data Engineer 候选人需要展示的能力:

能力 怎么通过这个项目体现
系统设计 架构图 + 组件选型理由 + Trade-off 分析
IaC / 可复现部署 一键部署脚本、Terraform、Helm
可观测性 Prometheus + Grafana 监控全栈
安全意识 Secrets 管理、TLS、网络隔离
文档能力 清晰的 README、ADR、运维手册
端到端思维 数据从摄入 → 清洗 → ML → 推理的完整链路

项目命名建议

SoloLakehouse 已经很好。标语建议:

SoloLakehouse — A production-ready, open-source Lakehouse platform for small teams.
One command to deploy your own Databricks-like environment.

二、项目结构(开源标准)

sololakehouse/
│
├── README.md                      ← 最重要的文件(见下文)
├── LICENSE                        ← Apache 2.0 推荐
├── ARCHITECTURE.md                ← 架构深度文档
├── CHANGELOG.md
├── Makefile                       ← 一键操作入口
├── .env.example                   ← 环境变量模板(不含真实密码)
│
├── docker-compose.yml             ← 核心:完整平台定义
├── docker-compose.override.yml    ← 开发环境覆盖(可选)
├── docker-compose.monitoring.yml  ← 可拆分的监控栈
│
├── scripts/
│   ├── setup.sh                   ← 首次安装向导
│   ├── health-check.sh            ← 全栈健康检查
│   ├── generate-secrets.sh        ← 自动生成 .env
│   └── demo-data.sh               ← 加载演示数据集
│
├── spark/
│   ├── jars/                      ← .gitignore,README 说明如何下载
│   ├── download-jars.sh           ← 自动下载依赖 JAR
│   └── conf/
│       └── spark-defaults.conf
│
├── trino/
│   └── etc/
│       ├── config.properties
│       └── catalog/
│           ├── iceberg.properties
│           └── hive.properties
│
├── airflow/
│   ├── dags/
│   │   ├── bronze_ingestion.py    ← 示范 DAG
│   │   ├── silver_transform.py
│   │   └── gold_aggregate.py
│   └── plugins/
│
├── notebooks/
│   ├── 01_quickstart.ipynb        ← 交互式 Quickstart
│   ├── 02_medallion_etl.ipynb     ← 端到端 ETL 演示
│   ├── 03_ml_training.ipynb       ← ML + MLflow 演示
│   └── templates/
│       └── spark_session.py
│
├── hive/
│   ├── conf/
│   └── lib/
│
├── monitoring/
│   ├── prometheus/
│   │   └── prometheus.yml
│   └── grafana/
│       └── dashboards/
│           ├── platform-overview.json    ← 预配置看板
│           └── spark-metrics.json
│
├── postgres/
│   ├── pg_hba.conf
│   └── init/
│       └── 01-create-databases.sql      ← 自动创建 airflow/feast DB
│
├── docs/                          ← MkDocs 站点源码
│   ├── mkdocs.yml
│   └── docs/
│       ├── index.md
│       ├── getting-started.md
│       ├── architecture.md
│       ├── components/
│       │   ├── spark.md
│       │   ├── iceberg.md
│       │   ├── trino.md
│       │   └── airflow.md
│       └── adr/                   ← Architecture Decision Records
│           ├── 001-iceberg-over-delta.md
│           ├── 002-trino-over-presto.md
│           └── 003-redpanda-over-kafka.md
│
├── terraform/                     ← Phase 2:云部署
│   ├── aws/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── modules/
│   │       ├── vpc/
│   │       ├── ec2/
│   │       └── s3/
│   ├── hetzner/                   ← 便宜的欧洲 VPS(适合你在德国)
│   │   ├── main.tf
│   │   └── cloud-init.yaml
│   └── README.md
│
├── k8s/                           ← Phase 3:Kubernetes 部署(可选)
│   ├── helm/
│   │   └── sololakehouse/
│   │       ├── Chart.yaml
│   │       ├── values.yaml
│   │       └── templates/
│   └── README.md
│
└── tests/
    ├── test_connectivity.py       ← 服务连通性测试
    ├── test_iceberg_rw.py         ← Iceberg 读写测试
    ├── test_etl_pipeline.py       ← DAG 端到端测试
    └── conftest.py

三、Makefile — 一键操作入口

这是开源项目的第一印象,必须让人三步跑起来:

.PHONY: help setup up down status logs test demo clean

COMPOSE = docker compose

help: ## Show this help
	@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | \
		awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-20s\033[0m %s\n", $$1, $$2}'

# ============================================================
# 🚀 Quick Start
# ============================================================

setup: ## First-time setup: generate secrets, download JARs, create dirs
	@echo "🔧 Generating environment variables..."
	@bash scripts/generate-secrets.sh
	@echo "📦 Downloading Spark JARs..."
	@bash spark/download-jars.sh
	@echo "📁 Creating data directories..."
	@mkdir -p data/{minio,postgres,spark/events,airflow/logs,jupyter,redpanda,grafana,prometheus,mlflow,hive-metastore/warehouse,cloudbeaver,beszel/{hub,socket,agent},npm/{data,letsencrypt}}
	@mkdir -p logs/npm notebooks airflow/{dags,plugins}
	@echo "✅ Setup complete! Run 'make up' to start."

up: ## Start all services
	$(COMPOSE) up -d
	@echo ""
	@echo "🏠 SoloLakehouse is starting..."
	@echo "   Spark Master UI:  http://localhost:8081"
	@echo "   JupyterLab:       http://localhost:8888"
	@echo "   Airflow:          http://localhost:8085"
	@echo "   Trino:            http://localhost:8080"
	@echo "   CloudBeaver:      http://localhost:8978"
	@echo "   MLflow:           http://localhost:5000"
	@echo "   Grafana:          http://localhost:3000"
	@echo "   MinIO Console:    http://localhost:9001"
	@echo ""
	@echo "⏳ Services may take 30-60s to fully initialize."

down: ## Stop all services
	$(COMPOSE) down

restart: ## Restart all services
	$(COMPOSE) down && $(COMPOSE) up -d

# ============================================================
# 📊 Operations
# ============================================================

status: ## Show service status
	$(COMPOSE) ps --format "table {{.Name}}\t{{.Status}}\t{{.Ports}}"

logs: ## Tail logs (usage: make logs s=spark-master)
	$(COMPOSE) logs -f $(s)

health: ## Run health check on all services
	@bash scripts/health-check.sh

# ============================================================
# 🎯 Demo & Testing
# ============================================================

demo: ## Load demo dataset and run sample pipeline
	@echo "📊 Loading NYC Taxi demo dataset..."
	@bash scripts/demo-data.sh
	@echo "✅ Demo data loaded! Open JupyterLab to explore."

test: ## Run integration tests
	@echo "🧪 Running connectivity tests..."
	python -m pytest tests/ -v

# ============================================================
# 🧹 Maintenance
# ============================================================

clean: ## Remove all data volumes (DESTRUCTIVE!)
	@echo "⚠️  This will delete ALL data. Press Ctrl+C to cancel."
	@sleep 5
	$(COMPOSE) down -v
	rm -rf data/

spark-sql: ## Open Spark SQL shell
	$(COMPOSE) exec spark-master spark-sql

trino-cli: ## Open Trino CLI
	$(COMPOSE) exec trino trino

四、scripts/generate-secrets.sh

#!/bin/bash
set -euo pipefail

ENV_FILE=".env"
EXAMPLE_FILE=".env.example"

if [ -f "$ENV_FILE" ]; then
    echo "⚠️  .env already exists. Skipping generation."
    echo "   Delete .env and re-run to regenerate."
    exit 0
fi

echo "🔐 Generating secrets..."

# 生成随机密码的函数
gen_password() { openssl rand -base64 24 | tr -d '/+=' | head -c 24; }
gen_hex() { openssl rand -hex 32; }
gen_fernet() { python3 -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())" 2>/dev/null || echo "REPLACE_ME_WITH_FERNET_KEY"; }

cat > "$ENV_FILE" << EOF
# ====================================
# SoloLakehouse Environment Variables
# Generated on $(date -u +"%Y-%m-%dT%H:%M:%SZ")
# ====================================

# ---- MinIO ----
MINIO_ROOT_USER=admin
MINIO_ROOT_PASSWORD=$(gen_password)

# ---- PostgreSQL ----
PG_PASSWORD=$(gen_password)

# ---- Grafana ----
GRAFANA_ADMIN_PASSWORD=$(gen_password)

# ---- Jupyter ----
JUPYTER_TOKEN=$(gen_hex)

# ---- Airflow ----
AIRFLOW_FERNET_KEY=$(gen_fernet)
AIRFLOW_SECRET_KEY=$(gen_hex)
AIRFLOW_ADMIN_PASSWORD=$(gen_password)
EOF

echo "✅ .env generated. Review it before starting: cat .env"

五、scripts/health-check.sh

#!/bin/bash
set -uo pipefail

RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'

check() {
    local name=$1 url=$2
    if curl -sf --connect-timeout 5 "$url" > /dev/null 2>&1; then
        echo -e "  ${GREEN}✓${NC} $name"
    else
        echo -e "  ${RED}✗${NC} $name ($url)"
    fi
}

echo ""
echo "🏥 SoloLakehouse Health Check"
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━"

echo ""
echo "Storage:"
check "MinIO"           "http://localhost:9000/minio/health/live"

echo ""
echo "Compute:"
check "Spark Master"    "http://localhost:8081"
check "Trino"           "http://localhost:8080/v1/info"

echo ""
echo "Notebook & IDE:"
check "JupyterLab"      "http://localhost:8888/api/status"
check "CloudBeaver"     "http://localhost:8978"

echo ""
echo "Orchestration:"
check "Airflow"         "http://localhost:8085/health"

echo ""
echo "ML Platform:"
check "MLflow"          "http://localhost:5000/health"

echo ""
echo "Monitoring:"
check "Prometheus"      "http://localhost:9090/-/healthy"
check "Grafana"         "http://localhost:3000/api/health"

echo ""

六、README.md — 最重要的文件

# 🏠 SoloLakehouse

**A production-ready, open-source Lakehouse platform you can deploy in minutes.**

One-command deployment of a complete Databricks-like environment built on
Apache Spark, Iceberg, Trino, Airflow, MLflow, and more.

![Architecture](docs/assets/architecture.png)

## Why SoloLakehouse?

Databricks is powerful but expensive. SoloLakehouse gives small teams and
individual engineers a self-hosted alternative with:

- 🔥 **Apache Spark** cluster for distributed computing
- 🧊 **Apache Iceberg** tables with ACID transactions and time travel
- ⚡ **Trino** for interactive SQL analytics
- 📓 **JupyterLab** with PySpark pre-configured
- 🔄 **Airflow** for workflow orchestration (Medallion ETL)
- 🧪 **MLflow** for experiment tracking and model registry
- 📊 **Grafana + Prometheus** for full-stack monitoring
- 🔐 **Nginx Proxy Manager** with auto TLS

## Quick Start

```bash
git clone https://github.com/yourusername/sololakehouse.git
cd sololakehouse
make setup   # Generate secrets, download JARs
make up      # Start all 15+ services
make demo    # Load sample dataset

Open http://localhost:8888 for JupyterLab.

Architecture

┌─────────────────────────────────────────────────────┐
│                  Nginx Proxy Manager                │
│              (TLS termination + routing)             │
└────────┬──────────┬──────────┬──────────┬───────────┘
         │          │          │          │
    JupyterLab  CloudBeaver  Airflow   MLflow    ...
         │          │          │          │
┌────────┴──────────┴──────────┴──────────┴───────────┐
│  Spark Master + Workers    │    Trino (SQL Engine)   │
├────────────────────────────┼─────────────────────────┤
│         Apache Iceberg (Table Format, ACID, TTL)     │
├────────────────────────────┼─────────────────────────┤
│   Hive Metastore (Catalog) │   PostgreSQL (Backend)  │
├──────────────────────────────────────────────────────┤
│              MinIO (S3-compatible Object Storage)     │
│     s3://warehouse/  s3://raw/  s3://mlflow/         │
└──────────────────────────────────────────────────────┘

Components

Layer Component Purpose
Storage MinIO S3-compatible object storage
Table Format Apache Iceberg ACID, time travel, schema evolution
Catalog Hive Metastore Table metadata management
Compute Apache Spark 3.5 Distributed data processing
SQL Trino 455 Interactive analytics
Notebook JupyterLab Interactive development
Orchestration Apache Airflow Workflow scheduling
ML MLflow Experiment tracking + model registry
Streaming Redpanda Kafka-compatible event streaming
Monitoring Prometheus + Grafana Metrics and dashboards

Documentation

Full docs: https://docs.sololake.space

Roadmap

  • [x] Core Lakehouse (Spark + Iceberg + Trino + MinIO)
  • [x] Notebook environment (JupyterLab)
  • [x] Workflow orchestration (Airflow + Medallion ETL)
  • [x] ML platform (MLflow)
  • [x] Full-stack monitoring (Prometheus + Grafana)
  • [ ] Terraform deployment (Hetzner / AWS)
  • [ ] Kubernetes Helm chart
  • [ ] Apache Gravitino (Unity Catalog alternative)
  • [ ] Feature Store (Feast)
  • [ ] Model serving endpoint

License

Apache License 2.0


---

## 七、Architecture Decision Records (ADR) — 面试加分项

这是高级工程师区别于初级的关键。每个技术选型写一个 ADR:

### `docs/docs/adr/001-iceberg-over-delta.md`

```markdown
# ADR 001: Apache Iceberg over Delta Lake

## Status: Accepted

## Context
We need an open table format for our Lakehouse. The main candidates
are Apache Iceberg, Delta Lake, and Apache Hudi.

## Decision
We chose Apache Iceberg.

## Rationale
1. **Multi-engine support**: Our architecture uses both Spark and
   Trino. Iceberg has first-class support in Trino, while Delta Lake
   requires a compatibility layer.
2. **Vendor neutrality**: Delta Lake is governed by Databricks.
   Iceberg is an Apache project with broader community governance.
3. **Hidden partition evolution**: Iceberg's partition spec can evolve
   without rewriting data — critical for schema changes in production.
4. **Catalog pluggability**: Iceberg supports Hive, REST, JDBC, and
   Gravitino catalogs, making future migration easier.

## Trade-offs
- Delta Lake has better Spark-native integration and simpler setup
- Delta Lake's OPTIMIZE and Z-ORDER are more mature
- Iceberg's REST catalog ecosystem is still evolving

## Consequences
- Trino can read/write the same tables as Spark without extra config
- We need Iceberg JARs in both Spark and Trino classpaths
- Future migration to Gravitino catalog will be straightforward

八、Terraform 部署(Phase 2)

Hetzner Cloud(推荐,欧洲便宜 VPS)

terraform/hetzner/main.tf

terraform {
  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = "~> 1.45"
    }
  }
}

variable "hcloud_token" {
  sensitive = true
}

variable "ssh_public_key" {
  description = "SSH public key for server access"
}

provider "hcloud" {
  token = var.hcloud_token
}

# SSH Key
resource "hcloud_ssh_key" "default" {
  name       = "sololakehouse"
  public_key = var.ssh_public_key
}

# Firewall
resource "hcloud_firewall" "lakehouse" {
  name = "sololakehouse-fw"

  rule {
    direction  = "in"
    protocol   = "tcp"
    port       = "22"
    source_ips = ["0.0.0.0/0", "::/0"]
  }
  rule {
    direction  = "in"
    protocol   = "tcp"
    port       = "80"
    source_ips = ["0.0.0.0/0", "::/0"]
  }
  rule {
    direction  = "in"
    protocol   = "tcp"
    port       = "443"
    source_ips = ["0.0.0.0/0", "::/0"]
  }
}

# Server
resource "hcloud_server" "lakehouse" {
  name        = "sololakehouse"
  server_type = "cpx41"          # 8 vCPU, 16 GB RAM, €28/mo
  image       = "ubuntu-24.04"
  location    = "fsn1"           # Falkenstein, Germany
  ssh_keys    = [hcloud_ssh_key.default.id]
  firewall_ids = [hcloud_firewall.lakehouse.id]

  user_data = file("cloud-init.yaml")
}

# Volume for persistent data
resource "hcloud_volume" "data" {
  name     = "sololakehouse-data"
  size     = 100                 # GB
  server_id = hcloud_server.lakehouse.id
  automount = true
  format    = "ext4"
}

output "server_ip" {
  value = hcloud_server.lakehouse.ipv4_address
}

terraform/hetzner/cloud-init.yaml

#cloud-config
package_update: true
packages:
  - docker.io
  - docker-compose-v2
  - git
  - make

runcmd:
  - systemctl enable --now docker
  - usermod -aG docker ubuntu
  - cd /opt && git clone https://github.com/yourusername/sololakehouse.git
  - cd /opt/sololakehouse && make setup && make up

九、Portfolio 展示策略

GitHub 仓库必须包含:

元素 为什么重要
README 有架构图 招聘经理 30 秒内决定是否继续看
一键部署能跑 证明你做的东西是真的
ADR 文档 证明你能做技术决策,不只是拼凑
Demo Notebook 可视化展示端到端能力
CI/CD (GitHub Actions) 自动测试 docker-compose 能启动
Star / 贡献者 社交证明(可以在论坛推广)

简历 / LinkedIn 上怎么写

SoloLakehouse — Open-Source Lakehouse Platform
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Designed and built a production-ready, self-hosted alternative to
Databricks, orchestrating 15+ services (Spark, Iceberg, Trino,
Airflow, MLflow) on Docker with one-command deployment.

• Implemented Medallion architecture (Bronze → Silver → Gold)
  with Apache Iceberg ACID tables on MinIO object storage
• Built end-to-end ML pipeline: data ingestion → feature
  engineering → training → experiment tracking → model registry
• Provisioned cloud infrastructure via Terraform (Hetzner Cloud)
  with automated TLS, monitoring (Prometheus/Grafana), and
  health checks
• Published as open-source project with comprehensive ADRs,
  documentation site (MkDocs), and integration test suite

Tech: Spark, Iceberg, Trino, Airflow, MLflow, Docker, Terraform,
     PostgreSQL, MinIO, Prometheus, Grafana, Python

面试中怎么讲

准备好回答这些问题:

  1. "Why Iceberg over Delta Lake?" → 指向你的 ADR
  2. "How would you scale this?" → Docker Swarm → K8s 演进路径
  3. "What was the hardest part?" → 跨引擎 catalog 一致性
  4. "How do you monitor it?" → 展示 Grafana 看板截图
  5. "What would you change?" → 展示你知道 trade-offs

十、时间线与优先级

阶段 内容 时间 对 100k+ 影响
Phase 1 补齐 Spark + Iceberg + Jupyter + Airflow 2 周 ⭐⭐⭐⭐
Phase 2 打包开源:Makefile + scripts + README + ADR 1 周 ⭐⭐⭐⭐⭐
Phase 3 Terraform Hetzner 部署 1 周 ⭐⭐⭐⭐
Phase 4 Demo Notebook + MkDocs 文档站 1 周 ⭐⭐⭐
Phase 5 K8s Helm Chart(可选) 2 周 ⭐⭐⭐
Phase 6 在 Reddit/HN/LinkedIn 推广获 Star 持续 ⭐⭐⭐⭐⭐
Phase 2 是 ROI 最高的。一个包装精良、文档完善的开源项目, 比一个功能更多但 README 只有三行的项目值 10 倍。