Skip to content
Operations & Optimization Last updated: May 14, 2026

Iceberg Bloom Filters

Bloom filter indexes in Apache Iceberg enable probabilistic row-level skipping by allowing query engines to determine with near certainty whether a specific value exists in a data file before reading it, dramatically improving point-lookup query performance.

iceberg bloom filterbloom filter index icebergiceberg point lookup optimizationiceberg file skipping bloomiceberg hash index

Bloom Filters in Apache Iceberg

A bloom filter is a probabilistic data structure that answers the question “is this value in this set?” with two possible results:

In the context of Apache Iceberg, bloom filters are used as file-level indexes that enable query engines to skip data files that definitely don’t contain a queried value — without reading the file. This is particularly powerful for point lookups (queries with exact equality predicates like WHERE user_id = 12345) where min/max statistics are useless (every file’s min/max range might include 12345).

Bloom Filters vs. Min/Max Statistics

Min/max statistics and bloom filters serve complementary skipping roles:

ScenarioMin/MaxBloom Filter
Range queries (WHERE total BETWEEN 100 AND 200)ExcellentPoor
Point lookups (WHERE user_id = 12345)Poor (can’t skip)Excellent
Low-cardinality columnsGoodOverkill
High-cardinality IDs (UUIDs, user IDs)PoorExcellent

For high-cardinality ID columns in well-clustered tables, bloom filters can achieve 99%+ skip rates for point lookups.

Bloom Filters in Parquet Files

Apache Parquet has native bloom filter support at the row group level. When a Parquet file is written with bloom filters enabled, each row group contains a bloom filter structure for specified columns in the file footer.

Enabling in Spark:

# Enable Parquet bloom filters for specific columns
spark.conf.set("spark.sql.parquet.bloom.filter.enabled", "true")
spark.conf.set("spark.sql.parquet.bloom.filter.column.enabled.user_id", "true")
spark.conf.set("spark.sql.parquet.bloom.filterFPP", "0.05")  # 5% false positive rate

With bloom filters written to Parquet:

Bloom Filters in Iceberg Puffin Files

Beyond row-group-level Parquet bloom filters, Iceberg’s Puffin format supports table-level bloom filter indexes stored as Puffin blobs. These are file-level (not row-group-level) bloom filters that allow skipping entire data files before opening them.

Puffin bloom filters are the Iceberg equivalent of file-skip indexes in databases — they enable the query planner to eliminate files at the manifest scan stage, before any file I/O.

Note: Puffin-based bloom filters are under active development in the Iceberg specification as of 2025. Parquet-level bloom filters are broadly supported today.

Sizing Bloom Filters

Bloom filter accuracy vs. size is a tunable tradeoff:

Common FPP settings:

For most analytical workloads, 1–5% FPP provides the right balance between accuracy (skip rate) and bloom filter overhead in the Parquet footer.

When to Use Bloom Filters

Bloom filters are most valuable for:

  1. High-cardinality ID columns: user_id, order_id, device_id, session_id — min/max is useless, bloom filters are ideal.
  2. Hash/UUID columns: Even min/max doesn’t help for random UUIDs.
  3. Low-selectivity point lookups on large tables: Where a typical query filters to <0.01% of rows.

Bloom filters add overhead to writes (computing the filter) and add size to file footers. Don’t enable them for:

Bloom Filters and Iceberg Compaction

Bloom filters are only as fresh as the last file rewrite. If new rows are appended to a table (new files without bloom filters), those files don’t benefit from bloom filter skipping.

For tables where bloom filters are critical for performance, include bloom filter generation in your compaction strategy:

-- Compaction with bloom filter enabled on output files
CALL system.rewrite_data_files(
  table => 'db.orders',
  options => map(
    'write.parquet.bloom-filter-enabled.column.user_id', 'true',
    'write.parquet.bloom-filter-fpp.column.user_id', '0.05'
  )
);

📚 Go Deeper on Apache Iceberg

Alex Merced has authored three hands-on books covering Apache Iceberg, the Agentic Lakehouse, and modern data architecture. Pick up a copy to master the full ecosystem.

← Back to Iceberg Knowledge Base