Iceberg Puffin Files
Puffin is the Apache Iceberg file format for storing advanced table-level statistics and indexes that go beyond the column min/max bounds available in manifest files. Puffin files attach supplementary statistical metadata to Iceberg table snapshots, enabling query planners to make better cost-based optimization decisions — such as accurate join ordering, smarter partition elimination, and bloom-filter-based row skipping.
The name “Puffin” is deliberately playful — following Iceberg’s arctic theme — and refers to the bird species that uses the same name.
Why Puffin Exists
Manifest files store per-file column statistics: min/max values, null counts, value counts. These are powerful for data skipping but have limitations:
- Min/max bounds are blunt: A column with values
[1, 1000000]min/max tells you almost nothing useful for selectivity estimation. - No cardinality information: Query planners need to know “how many distinct values does
customer_idhave?” to correctly order joins and estimate output sizes. - No probabilistic indexes: Bloom filters require per-file hash structures that can’t fit in the manifest format.
Puffin adds a dedicated file format to attach these richer statistics to snapshots — separate from manifests, and extensible for future statistics types.
Puffin File Structure
A Puffin file is a binary format with:
- A file-level magic header identifying it as a Puffin file.
- One or more blobs: named, typed data structures (statistics, indexes).
- A footer with blob metadata (byte offsets, lengths, compression, type, associated snapshot, associated columns).
- A file-level footer magic for validation.
Each blob in a Puffin file has:
type: The statistics type (e.g.,apache-datasketches-theta-v1,apache-datasketches-hll-v1)fields: Which column IDs the statistic coverssnapshot-id: The snapshot this statistic was computed forsequence-number: The sequence number of the snapshot
Supported Statistics Types
Apache DataSketches Theta Sketch (NDV)
Estimates the number of distinct values (NDV) for a column using the Theta sketch algorithm from the Apache DataSketches library. NDV is critical for join cardinality estimation.
blob type: "apache-datasketches-theta-v1"
→ answers: "approximately how many distinct values does customer_id have?"
→ use: join ordering, GROUP BY cardinality estimation
Apache DataSketches HLL Sketch
The HyperLogLog++ sketch — another NDV estimation algorithm with different accuracy/size tradeoffs.
blob type: "apache-datasketches-hll-v1"
Bloom Filter Index (Future / In Progress)
File-level bloom filters stored in Puffin would allow the engine to determine “does this data file contain a row where user_id = 12345?” with a single hash lookup — eliminating files that can prove they don’t contain a value.
Puffin Files and the Snapshot
Puffin files are associated with a snapshot via the snapshot’s statistics-files property in the table metadata:
{
"snapshot-id": 8027658604211071520,
"statistics": [
{
"snapshot-id": 8027658604211071520,
"statistics-path": "s3://bucket/warehouse/db/orders/metadata/snap-8027...puffin",
"file-size-in-bytes": 16384,
"file-footer-size-in-bytes": 512,
"blob-metadata": [...]
}
]
}
When a snapshot is expired, its associated Puffin files are also cleaned up.
Generating Puffin Statistics
Puffin statistics must be explicitly computed — they are not generated during normal writes. In Spark:
-- Analyze a table to compute and store column statistics as Puffin
ANALYZE TABLE db.orders COMPUTE STATISTICS FOR ALL COLUMNS;
-- Verify statistics were written
SELECT * FROM db.orders.snapshots;
-- look for statistics-files in the snapshot metadata
In Dremio Cloud and Enterprise, statistics collection can be triggered via the UI or API and is used by the Intelligent Query Engine’s cost-based optimizer.
Puffin and Query Planning
Engines that support Puffin statistics use them in their query planners:
- Join reordering: Use NDV estimates to order joins by estimated output size (smallest first).
- Aggregate estimation: Estimate GROUP BY output cardinality for memory allocation.
- Partition elimination improvements: Use cardinality info to refine file pruning decisions beyond min/max bounds.
Puffin is an evolving area of the Iceberg spec — expect bloom filter support, histogram statistics, and multi-column statistics to emerge as the ecosystem matures.