What Is a Data Lakehouse?
A data lakehouse is an architecture that stores data in open file formats on cheap object storage (Amazon S3, Google Cloud Storage, Azure Data Lake Storage), while adding a structured table layer on top that gives you transactions, schema enforcement, and fast query performance. You get the cost profile of a data lake and the reliability of a data warehouse in a single system.
The term was first formalized in a 2020 paper from UC Berkeley and Databricks, which argued that the traditional two-tier approach (raw lake for ML, separate warehouse for BI) was causing real problems: data duplication, staleness, high cost, and brittle pipelines. The lakehouse collapses those tiers into one.
The Three Layers
Every lakehouse has three functional layers that work together. Understanding what each one does makes the rest of the architecture click.
(Databases, APIs, Streams)"] --> B["Storage Layer
Object Storage: S3 / GCS / ADLS"] B --> C["Table Format Layer
Apache Iceberg — Metadata + Manifests + Data Files"] C --> D["Catalog Layer
Apache Polaris · AWS Glue · Project Nessie"] D --> E["Query Engines
Dremio · Spark · Trino · Athena · Snowflake"] E --> F["Consumers
BI Tools · AI Agents · Data Science · Applications"]
Storage Layer
Object storage holds the actual data in columnar file formats, primarily Apache Parquet. It is cheap, durable, and serverless. You pay for what you store, not for idle compute. Most organizations already have petabytes here sitting in raw form.
Table Format Layer
This is where the lakehouse diverges from a plain data lake. A table format sits between the raw files and the query engines, tracking exactly which files belong to which table, what the schema is, and what changed in each transaction. Apache Iceberg is the dominant open table format for this layer. It provides ACID transactions, schema evolution, time travel, and a consistent view of data across every engine that reads from the table.
Catalog Layer
The catalog is the registry that maps table names to their metadata
locations. When a query engine wants to read analytics.orders, it asks the catalog where the current metadata file lives. Catalogs
like
Apache Polaris,
AWS Glue, and
Project Nessie expose the
Iceberg REST Catalog API,
which means any compatible engine can connect to any catalog using the
same standard interface.
How a Data Lakehouse Works in Practice
When a pipeline writes new data, it creates Parquet files in object storage and records those files in Iceberg metadata. That metadata update is atomic: either the entire commit succeeds or none of it does. Readers always see a consistent snapshot of the table, never a half-written state. This is what makes the lakehouse reliable enough for production BI workloads, which was not true of raw data lakes.
When a query engine (Spark, Trino, Dremio, Athena) reads a table, it fetches the current metadata from the catalog, identifies which files match the query filters, and reads only those files. The table format's per-file statistics let the engine skip large portions of the dataset without reading them, which is how lakehouses compete on query speed with traditional warehouses.
Data Lakehouse vs Data Lake vs Data Warehouse
The three architectures solve different versions of the same problem. Here is how they compare across the dimensions that matter most in practice.
| Dimension | Data Lake | Data Warehouse | Data Lakehouse |
|---|---|---|---|
| Storage format | Open (raw files) | Proprietary | Open (Parquet + table format) |
| ACID transactions | No | Yes | Yes (via Iceberg) |
| Schema enforcement | Read-time only | Write-time | Write-time + evolution |
| Time travel | No | Limited | Yes (snapshot history) |
| Multi-engine access | Yes (raw files) | No (proprietary API) | Yes (REST Catalog standard) |
| ML / AI workloads | Yes | Difficult | Yes |
| Storage cost | Low | High | Low |
| BI / SQL query speed | Slow | Fast | Fast (with optimization) |
| Vendor lock-in | Low | High | Low (open standards) |
For a deeper look at this comparison, see the full comparison guide.
The Role of Open Table Formats
Open table formats are what make the lakehouse architecture real. Without them, you have a raw data lake with all its consistency problems. With them, you have a governed, queryable, multi-engine table layer. Three formats dominate today:
- Apache Iceberg — the broadest multi-engine support, the most active cloud vendor adoption, and the most governance-focused design. This is the right default for new projects.
- Delta Lake — tightly integrated with Databricks and the Spark ecosystem. Strong choice if Databricks is your primary compute.
- Apache Hudi — designed for high-frequency key-based upserts and native incremental processing. Used heavily in Spark-centric streaming pipelines.
For a complete side-by-side breakdown, see the open table format comparison.
When a Data Lakehouse Is the Right Choice
A lakehouse makes sense when you need more than one of the following from the same data: SQL analytics, machine learning training data, real-time streaming, and AI agent access. If your workloads are entirely SQL-based and your data volume is moderate, a managed data warehouse may be simpler. If you only do ML on raw files with no SQL requirements, a plain data lake works. The lakehouse is the right call when you need all of those things without paying for two separate systems or copying data between them.
It also makes sense when vendor independence matters. Because the data lives in open formats (Parquet) in your own object storage and is governed by an open catalog API, you are not locked into any single vendor's proprietary format or query engine.
The Agentic Lakehouse
The latest evolution of the lakehouse architecture adds a governed AI access layer on top. AI agents can query your Iceberg tables through a semantic layer that translates business questions into SQL, executes them against governed data, and returns results the agent can reason over. This pattern is called the Agentic Lakehouse, and it is where lakehouse architecture is heading in 2025 and beyond.
Frequently Asked Questions
Is a data lakehouse the same as a data lake?
No. A data lake is raw file storage with no table semantics, no transactions, and no consistent query interface. A data lakehouse adds a table format layer (Apache Iceberg, Delta Lake, or Apache Hudi) that gives you ACID guarantees, schema enforcement, and fast query planning on top of the same low-cost object storage.
Do I need Apache Iceberg to build a data lakehouse?
You need a table format. Apache Iceberg is the most widely supported and governance- friendly choice, with native support from every major cloud provider and query engine. Delta Lake and Apache Hudi are alternatives, but Iceberg has the broadest multi-engine write support and an open catalog standard (the Iceberg REST Catalog).
How does a data lakehouse handle concurrent writes?
Apache Iceberg uses optimistic concurrency control. Each writer reads the current table state, makes its changes, and tries to commit a new snapshot. If another writer committed first and the changes conflict, the commit fails and the writer retries. Non-conflicting concurrent writes (to different partitions, for example) succeed without coordination.
What query engines work with a data lakehouse?
Any engine that implements the Iceberg table format or REST Catalog API: Apache Spark, Apache Flink, Trino, Dremio, AWS Athena, Google BigQuery, Snowflake, DuckDB, StarRocks, Apache Doris, and more. The open standard is what makes multi-engine access practical rather than theoretical.
Go Deeper
- Apache Iceberg Explained — how the table format works under the hood
- Apache Iceberg Architecture — snapshots, manifests, and the metadata tree
- What Is an Agentic Lakehouse — AI agents on governed lakehouse data
- Apache Iceberg Knowledge Base — 115 technical reference pages