SoloLakehouse: From Training Models to Building a Real Data Platform

Last updated on 31 Jan 2026

For a long time, my work in AI looked like this:

Open a notebook.
Load some data.
Train a model.
Tune hyperparameters.
Repeat.

And honestly - it worked.

Models improved.
Accuracy went up.
Papers progressed.

But something always felt… incomplete.

Because every time I looked at real production systems in industry, I noticed one thing:

👉 The model is only a tiny part of the system.

The hard parts are everything around it.

Where does the data live?
Who manages schemas?
How do you query it at scale?
How do you track experiments?
How do you reproduce results?
How do you move from research to production?

That’s not “just ML”.

That’s platform engineering.

And I realized:
If I only train models, I’m only learning half the game.

The decision: build it myself

So I gave myself a small challenge:

Instead of using only managed cloud tools,
what if I build the whole stack from scratch?

Not just a pipeline.

Not just a demo.

But a real, end-to-end lakehouse platform - like a mini Databricks - running on my own infrastructure.

That’s how SoloLakehouse (SLH) started.

“Solo” because I’m building it alone.
“Lakehouse” because it combines storage, compute, and ML into one system.

What SoloLakehouse looks like today

Nothing fancy.
Just open-source pieces assembled carefully like LEGO:

MinIO → object storage
Hive Metastore + PostgreSQL → metadata catalog
Trino → distributed SQL engine
Spark → ETL + ML training
MLflow → experiment tracking
Nginx → secure access gateway

Together, they form a complete workflow:

raw data → ETL → tables → SQL analytics → model training → experiment tracking → artifacts

For the first time, my projects feel less like “notebooks”
and more like actual systems.

The mindset shift

This project quietly changed how I think about AI engineering.

Before, my questions were mostly:

Which model performs better?
Should I use CNN or LSTM?
How can I squeeze out 2% more accuracy?

Now my questions sound different:

How is the data versioned?
Is this pipeline reproducible?
Can others query this easily?
What happens when the dataset grows 10×?
How would this run in production?

In other words:

👉 I stopped thinking only like a researcher
👉 and started thinking like a platform engineer

And that shift feels huge.

Because in real life, models fail less often than systems do.

Bad storage, messy schemas, missing tracking, fragile pipelines —
those are what really break projects.

Why this mattered more than I expected

Ironically, building SoloLakehouse taught me more than many pure ML experiments.

Not because it’s complicated.

But because it forced me to understand:

how Spark actually reads from S3
how catalogs manage table metadata
how SQL engines query data lakes
how experiments stay reproducible
how ML fits into a bigger architecture

Things you don’t learn from notebooks alone.

It feels closer to how companies like Databricks, Snowflake, or modern data teams actually operate.

And that’s exactly the kind of engineer I want to become.

What’s next

The foundation is stable now.

Next steps:

Delta tables
Medallion architecture (Bronze / Silver / Gold)
Unity Catalog–style governance
more automation & MLOps

Slowly turning SoloLakehouse into a small, self-hosted production-grade platform.

Final thoughts

If there’s one thing I learned, it’s this:

Training models is fun.
Building systems is transformative.

SoloLakehouse started as a side project.

But it ended up changing how I see AI engineering completely.

And honestly…
designing the platform has been just as satisfying as building the models.

More updates soon 🚀