Apache Iceberg Explained

Apache Iceberg is an open table format for large analytic tables stored in object storage. It was created at Netflix in 2017, open-sourced in 2018, and graduated to a top-level Apache Software Foundation project in 2020. Today it is the most widely adopted open table format across cloud providers, query engines, and data platforms.

At its core, Iceberg solves a problem that plagued Hive-style data lakes: there was no reliable way to know exactly which files belonged to a table at any given moment, which made concurrent writes dangerous, schema changes painful, and consistent reads nearly impossible at scale. Iceberg replaces that folder-based approach with a proper metadata system that tracks every file, every schema change, and every transaction.

The Problem Iceberg Solves

In the Hive model, a table is a directory. Writers add files to that directory, and readers scan whatever is there. This causes three practical problems:

No atomicity. A reader that starts a scan while a writer is adding files may see a partially-complete write. The data is inconsistent.
Partition discovery is slow. Finding all the partitions in a large table requires listing every subdirectory in object storage, which is expensive and slow.
Schema changes break things. Renaming or reordering columns in a Hive table often requires rewriting all the data files.

Iceberg addresses all three with a metadata-first design. The table is not a directory. It is a pointer to a metadata file that describes the complete table state.

The Core Abstraction: Snapshots and Manifests

Iceberg tracks table state through a hierarchy of metadata files. Each layer summarizes the one below it, which lets query engines do most of their work at the metadata level before touching any actual data.

graph TD A["Catalog
(table name → metadata pointer)"] A --> B["Table Metadata JSON
(schema, partition spec, snapshot list)"] B --> C["Manifest List (Avro)
(one entry per manifest file, with partition summaries)"] C --> D1["Manifest File (Avro)
(one entry per data file, with column stats)"] C --> D2["Manifest File (Avro)
(one entry per data file, with column stats)"] D1 --> E1["Data File (Parquet)
actual rows"] D1 --> E2["Data File (Parquet)
actual rows"] D2 --> E3["Data File (Parquet)
actual rows"]

When a query arrives, the engine reads the catalog to find the metadata file, reads the metadata to find the current snapshot's manifest list, reads each manifest to find which data files match the query filters (using the per-file min/max statistics), and only then reads the actual Parquet data. A query that filters on a date column may skip 99% of the data files without opening them.

For a deeper walkthrough of this structure, see the Apache Iceberg Architecture guide.

Key Capabilities

ACID Transactions

Iceberg uses optimistic concurrency control. A writer reads the current table state, makes changes, and atomically swaps the metadata pointer to a new snapshot. Readers always see a complete, consistent snapshot. Concurrent writers retry if a conflict occurs.

Schema Evolution Without Data Rewrites

Iceberg assigns every column a numeric ID rather than relying on column names. This means you can rename a column, add a column, or reorder columns without touching a single data file. The old files still work correctly because Iceberg maps the column IDs to the new names at read time.

Hidden Partitioning

In Hive, users have to know the partition columns and include them in every query to get partition pruning. Iceberg handles this automatically. You declare a partition transform (days(order_date), for example), and Iceberg applies it at write time and pruning at read time without exposing partition columns to query authors.

Time Travel and Rollback

Every commit creates an immutable snapshot. You can query any past snapshot by timestamp or snapshot ID, compare two snapshots to see what changed, and roll back a table to any previous state with a single metadata operation (no data is rewritten on rollback).

Partition Evolution

You can change a table's partition scheme without rewriting existing data files. Old files are queried with the old partition layout, new files use the new layout. Iceberg's query planner handles both transparently.

How a Write Commit Works

sequenceDiagram participant W as Writer (Spark / Flink / etc.) participant Cat as Catalog participant S3 as Object Storage W->>Cat: Load table (get metadata location) Cat-->>W: metadata.json v4 W->>S3: Write new Parquet data files W->>S3: Write manifest file (lists new data files) W->>S3: Write manifest list (snapshot) W->>S3: Write new metadata.json v5 W->>Cat: Commit — swap pointer from v4 to v5 Cat-->>W: Success (or 409 Conflict — retry)

The catalog commit is the atomic step. If two writers try to commit against the same metadata version, only one succeeds. The other retries from the current state. This is what gives Iceberg serializable isolation without requiring a distributed lock.

Engine Support

One reason Iceberg has become the default open table format is that virtually every query engine now supports it natively.

Engine	Read	Write	Notes
Apache Spark	Yes	Yes	Most complete integration; CoW + MoR
Apache Flink	Yes	Yes	Streaming sink with exactly-once
Trino	Yes	Yes	Full DML support
Dremio	Yes	Yes	Native with AI Semantic Layer
AWS Athena	Yes	Yes	Native via AWS Glue catalog
Google BigQuery	Yes	Yes	BigLake managed Iceberg tables
Snowflake	Yes	Yes	Iceberg tables + Open Catalog (Polaris)
DuckDB	Yes	Partial	Via iceberg extension
PyIceberg	Yes	Yes	Python-native client library
StarRocks / Doris	Yes	Yes	Full native support

The Iceberg Catalog and REST API

Catalogs are how engines find tables. Iceberg defines a standard REST Catalog API that any catalog implementation exposes: create/read/update/delete tables, list namespaces, commit snapshots, and vend storage credentials. Because the API is an open standard, any engine that implements it can talk to any catalog.

Apache Polaris, co-created by Dremio and Snowflake, is the reference implementation of this REST Catalog standard. Other catalogs that implement the same spec include Project Nessie, AWS Glue, and Snowflake Open Catalog.

For the full API reference, see the Iceberg REST Catalog guide.

Where Iceberg Fits in the Data Stack

graph LR A["Ingestion
(Kafka, CDC, batch ETL)"] --> B["Iceberg Tables
(Bronze / Silver / Gold)"] B --> C["Catalog
(Apache Polaris / Glue / Nessie)"] C --> D1["Analytics
(Dremio, Trino, Athena)"] C --> D2["ML / AI
(Spark, PyIceberg, DuckDB)"] C --> D3["AI Agents
(MCP, LangChain, Dremio AI Agent)"] D1 --> E["BI Tools
(Superset, Tableau, Power BI)"]

When Not to Use Iceberg

Iceberg is designed for large analytical tables. It adds overhead that is not worth it for small reference tables with a few thousand rows (use a regular database). It is also not a good fit when you need sub-millisecond point lookups by primary key (use a transactional database or key-value store). And if all your compute runs inside a single managed warehouse that already handles table management, adding Iceberg may create more complexity than value.

Go Deeper

Apache Iceberg Architecture — the full metadata tree, commit flow, and concurrency model
Iceberg vs Delta Lake vs Apache Hudi — when to choose each format
Iceberg REST Catalog and Apache Polaris — the catalog API that enables multi-engine access
Snapshots and Time Travel — how Iceberg manages history
Apache Iceberg Knowledge Base — 115 technical reference pages