Apache Iceberg vs Delta Lake vs Apache Hudi

Apache Iceberg, Delta Lake, and Apache Hudi are the three formats that have become the standard choices for mutable analytical tables in object storage. Each one solves the same core problem (consistent, updatable tables on cheap storage) but with different design priorities that make them better fits for different workloads.

This comparison is vendor-neutral. The goal is to help you pick the right format based on what your workload actually needs, not based on which vendor talks loudest.

Origins and Governance

Format	Created by	Year open-sourced	Governance	Primary design goal
Apache Iceberg	Netflix	2018	Apache Software Foundation	Multi-engine interoperability and open standards
Delta Lake	Databricks	2019	Linux Foundation Delta Lake	Reliable data lake on top of Spark
Apache Hudi	Uber	2019	Apache Software Foundation	High-frequency upserts and incremental processing

How Each Format Tracks Table State

The transaction log or metadata model is the most fundamental architectural difference between the three formats.

graph TD subgraph ICE["Apache Iceberg"] direction TB I1["metadata.json (current snapshot pointer)"] I2["Manifest List (snapshot state)"] I3["Manifest Files (per-file stats)"] I4["Parquet Data Files"] I1 --> I2 --> I3 --> I4 end subgraph DL["Delta Lake"] direction TB D1["_delta_log/ (JSON commit files + checkpoints)"] D2["Parquet Checkpoint (periodic state snapshot)"] D3["Parquet Data Files"] D1 --> D2 D1 --> D3 end subgraph HUDI["Apache Hudi"] direction TB H1[".hoodie/ timeline (commit files, clean files, compaction)"] H2["Base Files (Parquet)"] H3["Delta Log Files (MoR only)"] H1 --> H2 H1 --> H3 end

Iceberg builds an immutable tree per snapshot. Each snapshot points to a manifest list that summarizes all the manifests, which in turn list the data files with per-column statistics. Readers always start from a complete, self-describing snapshot without replaying a log.

Delta Lake stores a sequential log of JSON commit files. To find the current table state, you either replay the entire log or start from the latest Parquet checkpoint and replay the commits since then. This is simpler to implement but creates more I/O overhead at high commit rates before checkpointing.

Hudi maintains a timeline in a hidden .hoodie/ directory that records every commit, clean, compaction, and rollback as timeline actions. Hudi also stores per-record index metadata that Delta Lake and Iceberg do not, which is what enables its efficient key-based upsert capability.

Feature Comparison

Feature	Apache Iceberg	Delta Lake	Apache Hudi
Time travel	Yes (snapshot ID or timestamp)	Yes (version number or timestamp)	Yes (timeline-based)
Schema evolution	Full (column IDs, no rewrites)	Full	Full
Partition evolution	Yes (no rewrites)	Partial (rewrites needed for some changes)	Limited
Hidden partitioning	Yes	No	No
Row-level deletes	Yes (CoW + MoR, positional + equality)	Yes (deletion vectors in Delta 2.0+)	Yes (native, multiple strategies)
Branching and tagging	Yes (table-level branches and tags)	No (catalog-level via Unity only)	No
Record-level indexing	Bloom filters (Puffin)	Bloom filters, Z-order stats	Bloom filter, HBase, bucket, simple index
Open catalog standard	REST Catalog spec (open)	Unity Catalog API (proprietary)	HMS / REST (no open spec)
Credential vending	Yes (via Polaris, Nessie, Glue)	Via Unity Catalog (Databricks)	No standard mechanism

Multi-Engine Support

Multi-engine support is where the formats diverge most clearly.

Engine	Iceberg (read/write)	Delta Lake (read/write)	Hudi (read/write)
Apache Spark	Full	Full (best-in-class)	Full
Apache Flink	Full	Read + limited write	Full
Trino	Full	Read + write (connector)	Read (connector)
Dremio	Full (native)	Read (external table)	Limited
AWS Athena	Full	Full	Read
Google BigQuery	Full (BigLake)	No	No
Snowflake	Full (Iceberg tables + Open Catalog)	No	No
DuckDB	Read + partial write	No	No
PyIceberg	Full Python client	No equivalent	No equivalent

Delta Lake's UniForm feature (available since Delta 3.x) auto-generates Iceberg metadata alongside Delta metadata, allowing external Iceberg readers to access Delta tables in read-only mode. This is Databricks acknowledging that Iceberg's ecosystem reach is broader.

Streaming and Incremental Processing

graph LR subgraph STREAM["Streaming Suitability"] I["Apache Iceberg
Flink sink (exactly-once)
Snapshot-diff incremental reads
Good for streaming + batch hybrid"] D["Delta Lake
Spark Structured Streaming
Delta Change Data Feed (CDF)
Best with Databricks DLT"] H["Apache Hudi
Native incremental query by key
Flink + Spark streaming
Best for key-based upsert pipelines"] end

Hudi's native incremental query is more precise than Iceberg's snapshot-diff approach when you need to know exactly which record keys changed between two points in time. Iceberg's snapshot diff tells you which files changed, which is sufficient for most use cases but less granular than Hudi's per-record change tracking.

Governance and Ecosystem

Apache Iceberg has the most open governance ecosystem. The Iceberg REST Catalog specification is a published standard that any catalog can implement. Apache Polaris (co-created by Dremio and Snowflake), Project Nessie, AWS Glue, and Snowflake Open Catalog all implement this standard. Any engine that supports the REST spec connects to any of these catalogs without vendor-specific code.

Delta Lake has Databricks Unity Catalog, which provides strong governance within the Databricks ecosystem. Unity is proprietary, so its governance capabilities are not available to other engines without going through Databricks.

Apache Hudi relies primarily on the Hive Metastore for catalog services and does not have an open catalog API equivalent to the Iceberg REST spec.

Decision Framework

flowchart TD A["What is your primary requirement?"] A -->|"Multi-engine reads and writes
OR open catalog governance
OR cloud-native (AWS/GCP/Azure)"| B["Apache Iceberg"] A -->|"All-in on Databricks + Spark
Unity Catalog for governance"| C["Delta Lake"] A -->|"High-frequency key-based upserts
Spark-primary streaming pipelines"| D["Apache Hudi"] B --> E["Check: does your engine support Iceberg natively?
(yes for Spark, Flink, Trino, Dremio, Athena, BigQuery, Snowflake, DuckDB)"] C --> F["Check: are you comfortable with Databricks as primary compute?"] D --> G["Check: do you need per-record incremental semantics
or record-key-based indexing?"]

Your situation	Best format
New project, no existing vendor commitment	Apache Iceberg
All-in Databricks, using Unity Catalog	Delta Lake
Spark-based CDC pipeline with frequent key-based updates	Apache Hudi
AI agent analytics on enterprise data	Apache Iceberg (Dremio + Polaris)
Multi-cloud or multi-engine architecture	Apache Iceberg
AWS S3-native managed table service	Apache Iceberg (S3 Tables)
Google Cloud-native managed table service	Apache Iceberg (BigLake)
Existing Databricks + Delta tables, need Iceberg access	Delta Lake with UniForm (read-only external Iceberg)

The Industry Direction in 2025

The clearest signal of where the market is going is that multiple companies have invested in Iceberg compatibility even when their primary format is different. Databricks shipped UniForm specifically because external engines need Iceberg access. Snowflake co-created Apache Polaris with Dremio. AWS launched S3 Tables as a native managed Iceberg service. Google launched BigLake Managed Tables on Iceberg. Every major cloud provider has bet on Iceberg as the interoperability layer.

That does not mean Delta Lake and Hudi are going away. Both have strong user bases and well-defined use cases. But for new projects where multi-engine access and open governance matter, Iceberg is the safer long-term choice.

Go Deeper

Apache Iceberg Explained — the full Iceberg overview
Iceberg vs Delta Lake (KB) — detailed technical comparison
Iceberg vs Apache Hudi (KB) — detailed technical comparison
Four-Format Comparison Including Paimon (KB)
Iceberg REST Catalog and Apache Polaris