Apache Iceberg Architecture
Apache Iceberg represents every table as a tree of metadata files sitting in object storage alongside the actual data. Understanding that tree is the key to understanding how Iceberg delivers ACID guarantees, fast queries, and multi-engine access without a central coordinator. This page walks through each layer of the metadata tree, explains the read and write paths, and shows how the concurrency model works.
The Metadata Tree
The metadata tree has four levels. Each level references the one below it, and the catalog holds the entry point at the top.
table name → current metadata.json location"] META["Table Metadata JSON
schema • partition spec • sort order
snapshot list • current snapshot ID"] ML["Manifest List (Avro)
one record per manifest file
includes partition-level min/max summaries"] MF1["Manifest File (Avro)
one record per data file
includes column-level min/max, null counts"] MF2["Manifest File (Avro)
one record per data file
includes column-level min/max, null counts"] DF1["data-00001.parquet"] DF2["data-00002.parquet"] DF3["data-00003.parquet"] DEL["delete-00001.parquet
(positional or equality deletes)"] CAT --> META META --> ML ML --> MF1 ML --> MF2 MF1 --> DF1 MF1 --> DF2 MF2 --> DF3 MF2 --> DEL
Table Metadata JSON
Every Iceberg table has a current metadata JSON file. It records the table's entire state at a point in time: the full schema (with column IDs, not just names), all partition specs that have ever been active, all sort orders, and a list of all snapshots with their IDs and timestamps. The catalog holds a pointer to the current version of this file.
When you run ALTER TABLE ... ADD COLUMN, Iceberg writes a new
metadata JSON with the updated schema and swaps the catalog pointer. No
data files are rewritten.
Snapshots
A snapshot represents the state of a table at the moment a transaction committed. Each snapshot has a unique ID, a parent snapshot ID (forming a chain of history), a sequence number, a timestamp, a summary of what changed (rows added, files added/removed), and a pointer to the manifest list for that snapshot.
Snapshots are immutable. Once written, they never change. When you query a table, you are always querying a specific snapshot, which is what makes reads consistent and time travel possible.
Manifest List
The manifest list is an Avro file. Each record in it describes one manifest file and includes a summary of that manifest's partition range (the minimum and maximum partition values covered by the files in that manifest). This summary is what enables manifest-level partition pruning: the query planner reads only the manifest list headers (tiny) to determine which manifests can be skipped entirely before reading any manifest file.
Manifest Files
Each manifest file is also Avro. Every record describes one data file or
delete file and includes the file's path, format, partition values, record
count, file size, and per-column statistics: minimum
value, maximum value, null count, and distinct value estimate (NDV). These
column statistics are what enable file-level data skipping: if a file's
max order_date is earlier than the query's filter, the engine
skips that file without opening it.
Data and Delete Files
The actual data sits in Parquet files (or ORC or Avro) in object storage. Delete files record row-level deletions for Merge-on-Read tables. In Spec v2, delete files come in two types: positional deletes (which record file path and row position) and equality deletes (which record column values to match). In Spec v3, deletion vectors replace positional delete files with compact Roaring Bitmap blobs, reducing overhead significantly.
For more on the delete file types, see positional deletes and equality deletes.
The Read Path: How Query Planning Works
Skip manifests where max(date) < 2026-01-01 or region != AMER"] ML --> MF["4. Read qualifying manifests: evaluate file-level stats
Skip files where max(date) < 2026-01-01"] MF --> BP["5. Apply bloom filters (if present)
Skip files where query value not in bloom set"] BP --> READ["6. Read qualifying Parquet files
Apply row-group stats, then read matching columns"] READ --> RESULT["Return result rows"]
The key insight is that steps 1 through 5 are metadata operations. They do not read any Parquet data. A query that filters on a well-partitioned date column may skip 99% of the data files without any I/O against those files. This is how Iceberg achieves warehouse-level query performance on raw object storage.
The Write Path: Committing a Snapshot
The catalog commit is the only step that requires coordination. Everything before it (writing Parquet files, manifests, and the new metadata file) is independent and can be parallelized. The catalog performs a compare-and-swap: if the current metadata pointer still points to version N, it swaps to N+1 and returns success. If someone else already committed N+1, it returns a conflict error.
Concurrency and Isolation
Iceberg uses optimistic concurrency control. This means writers do not take locks. Instead, each writer reads the current state, does its work, and tries to commit. If a conflict occurs, it retries.
Whether a conflict actually blocks depends on what changed:
- Two appends to different partitions: no conflict, both succeed.
- Two writers both overwriting the same partition: conflict, one retries.
- A compaction job and an append job: usually compatible, both succeed (Iceberg's conflict resolution understands that compaction does not change the logical table contents).
This model scales well because most concurrent operations in a lakehouse do not touch the same data. For cases where strict serialization matters, catalog implementations like Apache Polaris support stronger conflict detection. See concurrent writes in Iceberg for a detailed walkthrough.
Partition Evolution
One of Iceberg's most practically useful features is the ability to change a table's partition scheme without rewriting any data. This works because each manifest file records which partition spec was active when its files were written. The query planner knows how to handle files written under different partition specs in the same table.
Example: A table was originally partitioned by months(order_date)
and the team wants to switch to days(order_date) for finer granularity.
After the ALTER TABLE command, new writes use the day-level partitioning.
Old files remain correctly queryable under the month-level spec. No migration
job is needed.
Snapshot Retention and Maintenance
Because every commit creates a new snapshot and preserves the old one, snapshots accumulate over time. Old snapshots reference data files that may have been compacted away, keeping those files from being garbage collected. This is the primary mechanism behind Iceberg's time travel capability, and also the primary source of storage cost growth if snapshots are not expired regularly.
The standard maintenance operations are:
- Expire snapshots — remove snapshots older than a retention window
- Remove orphan files — clean up files not referenced by any active snapshot
- Compaction — rewrite many small files into fewer large files for better query performance
- Rewrite manifests — consolidate many small manifests into fewer large ones
For a complete maintenance schedule and implementation, see the maintenance scheduling guide.
Column IDs and Schema Evolution
Every column in an Iceberg table has a numeric field ID assigned at creation time. Parquet files store data by field ID, not by column name. This is what makes schema evolution safe:
- Rename a column: the metadata JSON records the new name. Old Parquet files are still read correctly because they store data by field ID.
- Add a column: new files include the new field. Old files do not. The reader returns null for the missing field in old files.
- Drop a column: the field ID is retired. Old files still contain the data for that field, but it is not returned to readers (the reader skips it).
- Reorder columns: column order is metadata-only. Physical storage order is by field ID in Parquet, so reordering has no effect on data files.
Go Deeper
- Apache Iceberg Explained — the overview and key capabilities
- Snapshots and Time Travel — querying and managing table history
- REST Catalog and Apache Polaris — the catalog API in detail
- Predicate Pushdown — how query planning interacts with manifest statistics
- Puffin Files — extended statistics blobs for advanced data skipping