Skip to content

Apache Iceberg Architecture

Apache Iceberg represents every table as a tree of metadata files sitting in object storage alongside the actual data. Understanding that tree is the key to understanding how Iceberg delivers ACID guarantees, fast queries, and multi-engine access without a central coordinator. This page walks through each layer of the metadata tree, explains the read and write paths, and shows how the concurrency model works.

The Metadata Tree

The metadata tree has four levels. Each level references the one below it, and the catalog holds the entry point at the top.

graph TD CAT["Catalog
table name → current metadata.json location"] META["Table Metadata JSON
schema • partition spec • sort order
snapshot list • current snapshot ID"] ML["Manifest List (Avro)
one record per manifest file
includes partition-level min/max summaries"] MF1["Manifest File (Avro)
one record per data file
includes column-level min/max, null counts"] MF2["Manifest File (Avro)
one record per data file
includes column-level min/max, null counts"] DF1["data-00001.parquet"] DF2["data-00002.parquet"] DF3["data-00003.parquet"] DEL["delete-00001.parquet
(positional or equality deletes)"] CAT --> META META --> ML ML --> MF1 ML --> MF2 MF1 --> DF1 MF1 --> DF2 MF2 --> DF3 MF2 --> DEL

Table Metadata JSON

Every Iceberg table has a current metadata JSON file. It records the table's entire state at a point in time: the full schema (with column IDs, not just names), all partition specs that have ever been active, all sort orders, and a list of all snapshots with their IDs and timestamps. The catalog holds a pointer to the current version of this file.

When you run ALTER TABLE ... ADD COLUMN, Iceberg writes a new metadata JSON with the updated schema and swaps the catalog pointer. No data files are rewritten.

Snapshots

A snapshot represents the state of a table at the moment a transaction committed. Each snapshot has a unique ID, a parent snapshot ID (forming a chain of history), a sequence number, a timestamp, a summary of what changed (rows added, files added/removed), and a pointer to the manifest list for that snapshot.

Snapshots are immutable. Once written, they never change. When you query a table, you are always querying a specific snapshot, which is what makes reads consistent and time travel possible.

Manifest List

The manifest list is an Avro file. Each record in it describes one manifest file and includes a summary of that manifest's partition range (the minimum and maximum partition values covered by the files in that manifest). This summary is what enables manifest-level partition pruning: the query planner reads only the manifest list headers (tiny) to determine which manifests can be skipped entirely before reading any manifest file.

Manifest Files

Each manifest file is also Avro. Every record describes one data file or delete file and includes the file's path, format, partition values, record count, file size, and per-column statistics: minimum value, maximum value, null count, and distinct value estimate (NDV). These column statistics are what enable file-level data skipping: if a file's max order_date is earlier than the query's filter, the engine skips that file without opening it.

Data and Delete Files

The actual data sits in Parquet files (or ORC or Avro) in object storage. Delete files record row-level deletions for Merge-on-Read tables. In Spec v2, delete files come in two types: positional deletes (which record file path and row position) and equality deletes (which record column values to match). In Spec v3, deletion vectors replace positional delete files with compact Roaring Bitmap blobs, reducing overhead significantly.

For more on the delete file types, see positional deletes and equality deletes.

The Read Path: How Query Planning Works

flowchart TD Q["Query arrives: SELECT ... WHERE region='AMER' AND date >= '2026-01-01'"] Q --> C["1. Catalog lookup: get current metadata.json path"] C --> M["2. Read metadata.json: get current snapshot ID"] M --> ML["3. Read manifest list: evaluate partition summaries
Skip manifests where max(date) < 2026-01-01 or region != AMER"] ML --> MF["4. Read qualifying manifests: evaluate file-level stats
Skip files where max(date) < 2026-01-01"] MF --> BP["5. Apply bloom filters (if present)
Skip files where query value not in bloom set"] BP --> READ["6. Read qualifying Parquet files
Apply row-group stats, then read matching columns"] READ --> RESULT["Return result rows"]

The key insight is that steps 1 through 5 are metadata operations. They do not read any Parquet data. A query that filters on a well-partitioned date column may skip 99% of the data files without any I/O against those files. This is how Iceberg achieves warehouse-level query performance on raw object storage.

The Write Path: Committing a Snapshot

sequenceDiagram participant W as Writer participant Cat as Catalog participant S3 as Object Storage W->>Cat: Load table (get metadata.json version N) Cat-->>W: metadata.json path, schema, partition spec W->>S3: Write Parquet data files (parallel) W->>S3: Write manifest file (lists new data files + their stats) W->>S3: Write manifest list (new snapshot referencing manifests) W->>S3: Write metadata.json version N+1 (new snapshot added, pointer updated) W->>Cat: Commit — swap table pointer from N to N+1 Cat-->>W: 200 OK (or 409 Conflict if another writer committed first) Note over W,Cat: On 409: W re-reads current state, rebases changes, retries

The catalog commit is the only step that requires coordination. Everything before it (writing Parquet files, manifests, and the new metadata file) is independent and can be parallelized. The catalog performs a compare-and-swap: if the current metadata pointer still points to version N, it swaps to N+1 and returns success. If someone else already committed N+1, it returns a conflict error.

Concurrency and Isolation

Iceberg uses optimistic concurrency control. This means writers do not take locks. Instead, each writer reads the current state, does its work, and tries to commit. If a conflict occurs, it retries.

Whether a conflict actually blocks depends on what changed:

This model scales well because most concurrent operations in a lakehouse do not touch the same data. For cases where strict serialization matters, catalog implementations like Apache Polaris support stronger conflict detection. See concurrent writes in Iceberg for a detailed walkthrough.

Partition Evolution

One of Iceberg's most practically useful features is the ability to change a table's partition scheme without rewriting any data. This works because each manifest file records which partition spec was active when its files were written. The query planner knows how to handle files written under different partition specs in the same table.

Example: A table was originally partitioned by months(order_date) and the team wants to switch to days(order_date) for finer granularity. After the ALTER TABLE command, new writes use the day-level partitioning. Old files remain correctly queryable under the month-level spec. No migration job is needed.

Snapshot Retention and Maintenance

Because every commit creates a new snapshot and preserves the old one, snapshots accumulate over time. Old snapshots reference data files that may have been compacted away, keeping those files from being garbage collected. This is the primary mechanism behind Iceberg's time travel capability, and also the primary source of storage cost growth if snapshots are not expired regularly.

The standard maintenance operations are:

For a complete maintenance schedule and implementation, see the maintenance scheduling guide.

Column IDs and Schema Evolution

Every column in an Iceberg table has a numeric field ID assigned at creation time. Parquet files store data by field ID, not by column name. This is what makes schema evolution safe:

Go Deeper

📚 Go Deeper on Apache Iceberg

Alex Merced has authored three hands-on books covering Apache Iceberg, the Agentic Lakehouse, and modern data architecture. Pick up a copy to master the full ecosystem.