What Is a Lakehouse? Definition, Architecture & Open Table Formats Explained

A data team stores years of clickstream logs in cheap object storage, runs nightly ETL into a cloud warehouse for BI, and maintains a separate streaming pipeline for real-time dashboards. Three systems, three copies of “customer,” three definitions of “active user,” and a growing bill for moving the same bytes twice. A lakehouse is the architecture pattern that tries to collapse this sprawl: keep data in open, low-cost storage, but add warehouse-grade transactions, SQL performance, and governance on top through open table formats. This glossary entry defines what a lakehouse is, how it differs from a data lake and a data warehouse, which technologies make it work, and why AI agents querying your stack need to understand it.

Disclosure: Datus is a data engineering agent platform. This article explains lakehouse architecture as a general concept, referencing Datus alongside other tools and architectures in the category. See the end for more detail.

TL;DR

A lakehouse combines data lake economics and flexibility with warehouse-style ACID transactions, schema enforcement, and SQL analytics — typically on object storage plus an open table format layer.
It is not a vendor product name, not a data lake with a SQL engine bolted on, and not a replacement for every warehouse — but it reduces duplicate pipelines and copy-heavy “lake + warehouse” splits.
Open table formats — Apache Iceberg, Delta Lake, Apache Hudi — are the technical foundation that makes lakehouse semantics (upserts, time travel, schema evolution) possible on files.
Medallion architecture (Bronze / Silver / Gold) is a common organizational pattern on lakehouses, not a requirement of the definition.
AI agents and text-to-SQL systems fail on lakehouse estates when they see file paths instead of table semantics, miss partition keys, or treat Silver and Gold layers interchangeably — context must be lakehouse-aware.

1. Lakehouse: a working definition

Historically, organizations split analytics into two worlds. Data lakes stored raw files cheaply in S3, ADLS, or GCS — flexible, schema-on-read, weak on concurrent SQL and transactional updates. Data warehouses offered fast SQL, strong governance, and ACID guarantees — but often at higher cost and with less openness for ML and custom file processing.

A lakehouse merges the storage economics of the lake with warehouse-like capabilities by inserting a table abstraction layer on object storage. You still store Parquet (or similar) files in a bucket. But an open table format tracks which files belong to which table version, enables concurrent writers, supports UPDATE/MERGE, and exposes the table through SQL engines (Spark, Trino, Snowflake external tables, Databricks SQL, etc.).

Consider a concrete example. A growth team asks for “weekly active users by country from our product events table.” In a pure data lake, an analyst might grep paths like s3://events/year=2026/month=06/day=15/*.parquet, hope partition columns are consistent, and write fragile SQL that breaks when someone adds a new folder layout. In a lakehouse, they query analytics.events — a governed table with documented columns, partition spec country, and ACID commits so yesterday’s backfill does not race with today’s stream ingest. The SQL looks like a warehouse. The storage underneath is still open files.

A useful working definition:

Lakehouse is an architecture that stores analytical data primarily on open, low-cost object storage, managed through open table formats that provide ACID transactions, schema evolution, and SQL-accessible tables, blending data lake flexibility with warehouse-grade reliability.

That definition excludes “we put Parquet in S3 and query it with Athena sometimes” without transactional table semantics — that is a data lake with ad hoc SQL, not a full lakehouse practice. It also excludes proprietary walled gardens with no open format exit — those may behave like a lakehouse operationally but fail the openness part of the industry definition.

2. Lakehouse vs data lake vs data warehouse

These three terms overlap in marketing but differ in default guarantees:

Dimension	Data lake	Lakehouse	Data warehouse
Primary storage	Object storage, raw files	Object storage + table format layer	Proprietary columnar store (often)
Schema	Schema-on-read; weak enforcement	Schema evolution with enforced table metadata	Schema-on-write; strong typing
Transactions	Usually none at file set level	ACID at table level (format-dependent)	Full ACID
Typical workloads	Ingest everything; ML feature stores	Unified analytics + ML on same tables	BI SQL, curated marts
Cost profile	Lowest storage $	Low storage + compute separation	Higher platform $
Openness	Open files	Open files + open formats	Often proprietary

Lakehouse vs data lake: A data lake optimizes for landing raw data cheaply. A lakehouse adds table contracts — you agree on what events means, which files are visible, and which commits are atomic. Without that layer, agents and analysts inherit path archaeology.

Lakehouse vs data warehouse: Modern cloud warehouses increasingly support external tables and Iceberg/Delta reads — the boundary blurs. The architectural intent differs: lakehouse centers open storage you can read with multiple engines; traditional warehouse centers a single optimized SQL runtime. Many teams run both — lakehouse for raw and ML zones, warehouse for certified marts — which is why data catalog and semantic layer work spans both.

3. Open table formats: the lakehouse engine room

Lakehouse is not one product — it is a pattern built on open table formats (sometimes called table formats or lakehouse formats):

Format	Origin / ecosystem	Strengths	Common engines
Apache Iceberg	Netflix → Apache; vendor-neutral	Hidden partitioning, time travel, broad engine support	Spark, Trino, Flink, Snowflake, BigQuery
Delta Lake	Databricks → Linux Foundation	Unified batch/streaming, tight Spark integration	Spark, Databricks SQL, some external readers
Apache Hudi	Uber → Apache	Incremental upserts, streaming ingest	Spark, Flink, Hive

All three solve a similar problem: object storage is not a database. Files appear, disappear, and list inconsistently under concurrent writers. Table formats maintain a metadata layer (manifests, snapshots) so queries see consistent snapshots and writers can commit atomically.

For data engineering agents, this matters practically. An agent generating SQL against iceberg.analytics.orders must understand snapshot isolation, partition transforms (day vs hour buckets), and whether MERGE is supported — not just column names. Schema evolution — adding a column without rewriting all history — is a lakehouse benefit that also breaks naive “dump entire DDL into the prompt” context strategies.

4. Why organizations adopt lakehouse architecture

Common triggers:

Copy fatigue — ETL copies raw lake → warehouse daily; storage and sync cost dominate.
ML + SQL convergence — Data scientists want Python/Spark on the same tables analysts query in SQL.
Governance on raw zones — Bronze layers need auditability, not just a dumping ground.
Vendor exit strategy — Open formats reduce lock-in vs proprietary storage only.

Failure modes when the pattern is adopted without discipline:

“Lakehouse” label on a raw bucket — no table format, no governance → worse than a honest data lake.
Gold-layer sprawl — dozens of “curated” tables with overlapping grains and no semantic layer → metric drift returns.
Engine fragmentation — Iceberg table readable in Spark but not tested in the BI tool users actually run.

Adoption signals that indicate maturity: documented medallion or domain layers, catalog integration, format choice written down, and SLAs on Silver/Gold freshness — not just a POC notebook.

5. Medallion architecture on a lakehouse

Medallion architecture (Bronze / Silver / Gold) is a layering convention popularized in Databricks documentation and widely reused:

Layer	Typical contents	Quality bar
Bronze	Raw ingest, minimal transform	Append-only, schema as landed
Silver	Cleaned, conformed, deduplicated	Typed columns, standard keys
Gold	Business aggregates, KPI-ready	Certified metrics, star/summary tables

Medallion is not synonymous with lakehouse — you can run a lakehouse without medallion labels, or label medallion layers inside a warehouse. But on lakehouses, medallion gives agents and humans a navigable hierarchy: an agent answering executive KPI questions should prefer Gold definitions; an engineer debugging ingest should inspect Bronze paths and Silver rejects.

Agents that treat all layers as interchangeable produce plausible wrong numbers — e.g., counting raw click events in Bronze when Gold already deduplicates sessions.

6. Lakehouse and AI agents — context requirements

Text-to-SQL and data engineering agents need more than column lists on lakehouse estates:

Context gap	Symptom	What to inject
Table vs path confusion	SQL targets `s3://...` instead of registered table	Catalog entries with physical + logical names
Layer ambiguity	Revenue from Bronze vs Gold	Layer tags, certified table list
Format capabilities	`MERGE` against read-only snapshot	Engine + format feature matrix
Partition pruning	Full table scan, timeout	Partition spec, required filters
Time travel	Mixing snapshots in one report	Snapshot ID or “as of” policy

This is why schema linking on lakehouse data is harder than on a 50-table Postgres instance: more tables, more layers, more engines — and more ways to be syntactically correct but semantically wrong.

Tools in the ecosystem — Spark, Trino, Databricks, Snowflake Iceberg tables — each expose slightly different SQL dialect and transaction boundaries. An agent without engine-scoped context may generate valid Spark SQL for a Trino cluster.

7. Evaluation checklist: is your lakehouse ready for agents?

Before pointing an agent at lakehouse tables, teams can sanity-check:

Catalog coverage — Are Bronze/Silver/Gold tables registered with owners and descriptions?
Certified Gold list — Which tables are approved for executive metrics?
Semantic definitions — Do top KPIs link to Gold tables with documented grain?
Reference SQL library — Are vetted queries tagged by layer and engine?
Format documentation — Iceberg vs Delta vs Hudi per domain — and which engines are supported?
Freshness SLAs — When did Silver last succeed? Is Gold stale?

Missing items 1–3 predict the same class of failures as text-to-SQL on large warehouses — with extra layer confusion.

8. When a lakehouse is enough — and when to keep a warehouse

Lakehouse-first fits when:

Open storage + multi-engine access is a strategic requirement
ML and SQL share large raw/intermediate datasets
Team can invest in table format operations (compaction, retention, ACLs on storage)

Keep a dedicated warehouse (or warehouse marts) when:

Sub-second BI on heavily modeled dimensions is the primary SLA
Organization standardizes on one vendor SQL runtime with mature workload management
Legal/compliance mandates proprietary storage controls hard to map to object ACLs alone

Hybrid architectures are normal — lakehouse for ingest and ML features, warehouse for curated BI — bridged by semantic layers and catalogs.

Open-source data engineering agents that emphasize cross-stack context aim to span both sides without forcing a rip-and-replace narrative — connecting to warehouse adapters and lakehouse tables through a unified catalog view rather than betting on one storage religion.

Conclusion

A lakehouse is best understood as governed tables on open storage — not a marketing sticker on a raw bucket. Open table formats supply the ACID and schema machinery; medallion and domain layering supply organizational clarity. For AI agents, lakehouse complexity moves failures from syntax to layer, format, and engine context — the same shift that separates demo-grade text-to-SQL from production-grade data engineering agents.

Frequently asked questions

Is Databricks a lakehouse?

Databricks popularized the term and ships a tightly integrated lakehouse stack (Delta Lake + Spark + SQL warehouse). Databricks is a vendor implementation, not the definition of lakehouse. Teams run lakehouse patterns on AWS, GCP, Azure, and OSS stacks without Databricks — using Iceberg or Hudi with Trino, Flink, or Snowflake external tables.

Lakehouse vs data mesh — are they the same?

No. Lakehouse describes how data is stored and queried (storage + table format + SQL). Data mesh describes how ownership and delivery are organized (domain-oriented data products). A mesh domain can expose a lakehouse table as its product API — complementary ideas.

Do I need all three formats (Iceberg, Delta, Hudi)?

No. Most organizations standardize on one primary format per lake (often Iceberg or Delta) to reduce operational sprawl. Multi-format estates usually reflect mergers or team autonomy — agents and catalogs should document which format applies where.

Can text-to-SQL work on a lakehouse without a catalog?

For small POC schemas, yes — paste DDL and pray. For production lakehouses with hundreds of tables across layers, without a catalog and semantic definitions, accuracy collapses. Invest in catalog + certified Gold + reference SQL before blaming the model.

What changed in 2024–2026 for lakehouses?

Engine support for Iceberg converged across major warehouses and query engines; Iceberg became the default “neutral” format in many greenfield architectures. Agent and copilot vendors began advertising lakehouse awareness — but awareness in marketing rarely equals layer-aware context in the product. Verify with your own tables, not slide decks.

What is a data engineering agent? — why persistent context matters on complex estates
What is a data catalog? — discovery layer lakehouses depend on
What is a semantic layer? — governed metrics atop Gold tables

Disclosure: Datus is a data engineering agent platform. This glossary entry explains lakehouse architecture as a general concept and how cross-stack agents approach lakehouse context — alongside other tools and architectures in the category.