Skip to content

What Is a Data Lakehouse?

A data lakehouse is an architecture that stores data in open file formats on cheap object storage (Amazon S3, Google Cloud Storage, Azure Data Lake Storage), while adding a structured table layer on top that gives you transactions, schema enforcement, and fast query performance. You get the cost profile of a data lake and the reliability of a data warehouse in a single system.

The term was first formalized in a 2020 paper from UC Berkeley and Databricks, which argued that the traditional two-tier approach (raw lake for ML, separate warehouse for BI) was causing real problems: data duplication, staleness, high cost, and brittle pipelines. The lakehouse collapses those tiers into one.

The Three Layers

Every lakehouse has three functional layers that work together. Understanding what each one does makes the rest of the architecture click.

graph TD A["Source Systems
(Databases, APIs, Streams)"] --> B["Storage Layer
Object Storage: S3 / GCS / ADLS"] B --> C["Table Format Layer
Apache Iceberg — Metadata + Manifests + Data Files"] C --> D["Catalog Layer
Apache Polaris · AWS Glue · Project Nessie"] D --> E["Query Engines
Dremio · Spark · Trino · Athena · Snowflake"] E --> F["Consumers
BI Tools · AI Agents · Data Science · Applications"]

Storage Layer

Object storage holds the actual data in columnar file formats, primarily Apache Parquet. It is cheap, durable, and serverless. You pay for what you store, not for idle compute. Most organizations already have petabytes here sitting in raw form.

Table Format Layer

This is where the lakehouse diverges from a plain data lake. A table format sits between the raw files and the query engines, tracking exactly which files belong to which table, what the schema is, and what changed in each transaction. Apache Iceberg is the dominant open table format for this layer. It provides ACID transactions, schema evolution, time travel, and a consistent view of data across every engine that reads from the table.

Catalog Layer

The catalog is the registry that maps table names to their metadata locations. When a query engine wants to read analytics.orders, it asks the catalog where the current metadata file lives. Catalogs like Apache Polaris, AWS Glue, and Project Nessie expose the Iceberg REST Catalog API, which means any compatible engine can connect to any catalog using the same standard interface.

How a Data Lakehouse Works in Practice

When a pipeline writes new data, it creates Parquet files in object storage and records those files in Iceberg metadata. That metadata update is atomic: either the entire commit succeeds or none of it does. Readers always see a consistent snapshot of the table, never a half-written state. This is what makes the lakehouse reliable enough for production BI workloads, which was not true of raw data lakes.

When a query engine (Spark, Trino, Dremio, Athena) reads a table, it fetches the current metadata from the catalog, identifies which files match the query filters, and reads only those files. The table format's per-file statistics let the engine skip large portions of the dataset without reading them, which is how lakehouses compete on query speed with traditional warehouses.

Data Lakehouse vs Data Lake vs Data Warehouse

The three architectures solve different versions of the same problem. Here is how they compare across the dimensions that matter most in practice.

graph LR subgraph DW["Data Warehouse"] DW1["Proprietary storage"] DW2["Schema-on-write"] DW3["Fast SQL queries"] DW4["High cost at scale"] DW5["Hard to use with ML"] end subgraph DL["Data Lake"] DL1["Open object storage"] DL2["Schema-on-read"] DL3["Slow / inconsistent queries"] DL4["Low storage cost"] DL5["Good for raw ML data"] end subgraph LH["Data Lakehouse"] LH1["Open object storage"] LH2["Schema-on-write via table format"] LH3["Fast SQL + ML in one place"] LH4["Low cost with warehouse reliability"] LH5["Unified: BI + ML + AI agents"] end
Dimension Data Lake Data Warehouse Data Lakehouse
Storage format Open (raw files) Proprietary Open (Parquet + table format)
ACID transactions No Yes Yes (via Iceberg)
Schema enforcement Read-time only Write-time Write-time + evolution
Time travel No Limited Yes (snapshot history)
Multi-engine access Yes (raw files) No (proprietary API) Yes (REST Catalog standard)
ML / AI workloads Yes Difficult Yes
Storage cost Low High Low
BI / SQL query speed Slow Fast Fast (with optimization)
Vendor lock-in Low High Low (open standards)

For a deeper look at this comparison, see the full comparison guide.

The Role of Open Table Formats

Open table formats are what make the lakehouse architecture real. Without them, you have a raw data lake with all its consistency problems. With them, you have a governed, queryable, multi-engine table layer. Three formats dominate today:

For a complete side-by-side breakdown, see the open table format comparison.

When a Data Lakehouse Is the Right Choice

A lakehouse makes sense when you need more than one of the following from the same data: SQL analytics, machine learning training data, real-time streaming, and AI agent access. If your workloads are entirely SQL-based and your data volume is moderate, a managed data warehouse may be simpler. If you only do ML on raw files with no SQL requirements, a plain data lake works. The lakehouse is the right call when you need all of those things without paying for two separate systems or copying data between them.

It also makes sense when vendor independence matters. Because the data lives in open formats (Parquet) in your own object storage and is governed by an open catalog API, you are not locked into any single vendor's proprietary format or query engine.

The Agentic Lakehouse

The latest evolution of the lakehouse architecture adds a governed AI access layer on top. AI agents can query your Iceberg tables through a semantic layer that translates business questions into SQL, executes them against governed data, and returns results the agent can reason over. This pattern is called the Agentic Lakehouse, and it is where lakehouse architecture is heading in 2025 and beyond.

Frequently Asked Questions

Is a data lakehouse the same as a data lake?

No. A data lake is raw file storage with no table semantics, no transactions, and no consistent query interface. A data lakehouse adds a table format layer (Apache Iceberg, Delta Lake, or Apache Hudi) that gives you ACID guarantees, schema enforcement, and fast query planning on top of the same low-cost object storage.

Do I need Apache Iceberg to build a data lakehouse?

You need a table format. Apache Iceberg is the most widely supported and governance- friendly choice, with native support from every major cloud provider and query engine. Delta Lake and Apache Hudi are alternatives, but Iceberg has the broadest multi-engine write support and an open catalog standard (the Iceberg REST Catalog).

How does a data lakehouse handle concurrent writes?

Apache Iceberg uses optimistic concurrency control. Each writer reads the current table state, makes its changes, and tries to commit a new snapshot. If another writer committed first and the changes conflict, the commit fails and the writer retries. Non-conflicting concurrent writes (to different partitions, for example) succeed without coordination.

What query engines work with a data lakehouse?

Any engine that implements the Iceberg table format or REST Catalog API: Apache Spark, Apache Flink, Trino, Dremio, AWS Athena, Google BigQuery, Snowflake, DuckDB, StarRocks, Apache Doris, and more. The open standard is what makes multi-engine access practical rather than theoretical.

Go Deeper

📚 Go Deeper on Apache Iceberg

Alex Merced has authored three hands-on books covering Apache Iceberg, the Agentic Lakehouse, and modern data architecture. Pick up a copy to master the full ecosystem.