Skip to content

Apache Iceberg vs Delta Lake vs Apache Hudi

Apache Iceberg, Delta Lake, and Apache Hudi are the three formats that have become the standard choices for mutable analytical tables in object storage. Each one solves the same core problem (consistent, updatable tables on cheap storage) but with different design priorities that make them better fits for different workloads.

This comparison is vendor-neutral. The goal is to help you pick the right format based on what your workload actually needs, not based on which vendor talks loudest.

Origins and Governance

Format Created by Year open-sourced Governance Primary design goal
Apache Iceberg Netflix 2018 Apache Software Foundation Multi-engine interoperability and open standards
Delta Lake Databricks 2019 Linux Foundation Delta Lake Reliable data lake on top of Spark
Apache Hudi Uber 2019 Apache Software Foundation High-frequency upserts and incremental processing

How Each Format Tracks Table State

The transaction log or metadata model is the most fundamental architectural difference between the three formats.

graph TD subgraph ICE["Apache Iceberg"] direction TB I1["metadata.json (current snapshot pointer)"] I2["Manifest List (snapshot state)"] I3["Manifest Files (per-file stats)"] I4["Parquet Data Files"] I1 --> I2 --> I3 --> I4 end subgraph DL["Delta Lake"] direction TB D1["_delta_log/ (JSON commit files + checkpoints)"] D2["Parquet Checkpoint (periodic state snapshot)"] D3["Parquet Data Files"] D1 --> D2 D1 --> D3 end subgraph HUDI["Apache Hudi"] direction TB H1[".hoodie/ timeline (commit files, clean files, compaction)"] H2["Base Files (Parquet)"] H3["Delta Log Files (MoR only)"] H1 --> H2 H1 --> H3 end

Iceberg builds an immutable tree per snapshot. Each snapshot points to a manifest list that summarizes all the manifests, which in turn list the data files with per-column statistics. Readers always start from a complete, self-describing snapshot without replaying a log.

Delta Lake stores a sequential log of JSON commit files. To find the current table state, you either replay the entire log or start from the latest Parquet checkpoint and replay the commits since then. This is simpler to implement but creates more I/O overhead at high commit rates before checkpointing.

Hudi maintains a timeline in a hidden .hoodie/ directory that records every commit, clean, compaction, and rollback as timeline actions. Hudi also stores per-record index metadata that Delta Lake and Iceberg do not, which is what enables its efficient key-based upsert capability.

Feature Comparison

Feature Apache Iceberg Delta Lake Apache Hudi
Time travel Yes (snapshot ID or timestamp) Yes (version number or timestamp) Yes (timeline-based)
Schema evolution Full (column IDs, no rewrites) Full Full
Partition evolution Yes (no rewrites) Partial (rewrites needed for some changes) Limited
Hidden partitioning Yes No No
Row-level deletes Yes (CoW + MoR, positional + equality) Yes (deletion vectors in Delta 2.0+) Yes (native, multiple strategies)
Branching and tagging Yes (table-level branches and tags) No (catalog-level via Unity only) No
Record-level indexing Bloom filters (Puffin) Bloom filters, Z-order stats Bloom filter, HBase, bucket, simple index
Open catalog standard REST Catalog spec (open) Unity Catalog API (proprietary) HMS / REST (no open spec)
Credential vending Yes (via Polaris, Nessie, Glue) Via Unity Catalog (Databricks) No standard mechanism

Multi-Engine Support

Multi-engine support is where the formats diverge most clearly.

Engine Iceberg (read/write) Delta Lake (read/write) Hudi (read/write)
Apache Spark Full Full (best-in-class) Full
Apache Flink Full Read + limited write Full
Trino Full Read + write (connector) Read (connector)
Dremio Full (native) Read (external table) Limited
AWS Athena Full Full Read
Google BigQuery Full (BigLake) No No
Snowflake Full (Iceberg tables + Open Catalog) No No
DuckDB Read + partial write No No
PyIceberg Full Python client No equivalent No equivalent

Delta Lake's UniForm feature (available since Delta 3.x) auto-generates Iceberg metadata alongside Delta metadata, allowing external Iceberg readers to access Delta tables in read-only mode. This is Databricks acknowledging that Iceberg's ecosystem reach is broader.

Streaming and Incremental Processing

graph LR subgraph STREAM["Streaming Suitability"] I["Apache Iceberg
Flink sink (exactly-once)
Snapshot-diff incremental reads
Good for streaming + batch hybrid"] D["Delta Lake
Spark Structured Streaming
Delta Change Data Feed (CDF)
Best with Databricks DLT"] H["Apache Hudi
Native incremental query by key
Flink + Spark streaming
Best for key-based upsert pipelines"] end

Hudi's native incremental query is more precise than Iceberg's snapshot-diff approach when you need to know exactly which record keys changed between two points in time. Iceberg's snapshot diff tells you which files changed, which is sufficient for most use cases but less granular than Hudi's per-record change tracking.

Governance and Ecosystem

Apache Iceberg has the most open governance ecosystem. The Iceberg REST Catalog specification is a published standard that any catalog can implement. Apache Polaris (co-created by Dremio and Snowflake), Project Nessie, AWS Glue, and Snowflake Open Catalog all implement this standard. Any engine that supports the REST spec connects to any of these catalogs without vendor-specific code.

Delta Lake has Databricks Unity Catalog, which provides strong governance within the Databricks ecosystem. Unity is proprietary, so its governance capabilities are not available to other engines without going through Databricks.

Apache Hudi relies primarily on the Hive Metastore for catalog services and does not have an open catalog API equivalent to the Iceberg REST spec.

Decision Framework

flowchart TD A["What is your primary requirement?"] A -->|"Multi-engine reads and writes
OR open catalog governance
OR cloud-native (AWS/GCP/Azure)"| B["Apache Iceberg"] A -->|"All-in on Databricks + Spark
Unity Catalog for governance"| C["Delta Lake"] A -->|"High-frequency key-based upserts
Spark-primary streaming pipelines"| D["Apache Hudi"] B --> E["Check: does your engine support Iceberg natively?
(yes for Spark, Flink, Trino, Dremio, Athena, BigQuery, Snowflake, DuckDB)"] C --> F["Check: are you comfortable with Databricks as primary compute?"] D --> G["Check: do you need per-record incremental semantics
or record-key-based indexing?"]
Your situation Best format
New project, no existing vendor commitment Apache Iceberg
All-in Databricks, using Unity Catalog Delta Lake
Spark-based CDC pipeline with frequent key-based updates Apache Hudi
AI agent analytics on enterprise data Apache Iceberg (Dremio + Polaris)
Multi-cloud or multi-engine architecture Apache Iceberg
AWS S3-native managed table service Apache Iceberg (S3 Tables)
Google Cloud-native managed table service Apache Iceberg (BigLake)
Existing Databricks + Delta tables, need Iceberg access Delta Lake with UniForm (read-only external Iceberg)

The Industry Direction in 2025

The clearest signal of where the market is going is that multiple companies have invested in Iceberg compatibility even when their primary format is different. Databricks shipped UniForm specifically because external engines need Iceberg access. Snowflake co-created Apache Polaris with Dremio. AWS launched S3 Tables as a native managed Iceberg service. Google launched BigLake Managed Tables on Iceberg. Every major cloud provider has bet on Iceberg as the interoperability layer.

That does not mean Delta Lake and Hudi are going away. Both have strong user bases and well-defined use cases. But for new projects where multi-engine access and open governance matter, Iceberg is the safer long-term choice.

Go Deeper

📚 Go Deeper on Apache Iceberg

Alex Merced has authored three hands-on books covering Apache Iceberg, the Agentic Lakehouse, and modern data architecture. Pick up a copy to master the full ecosystem.