Iceberg Table Clustering
Table clustering in Apache Iceberg refers to organizing data within files so that rows with similar values in queried columns are co-located in the same data files. Well-clustered tables provide dramatically better query performance through tighter column statistics and more effective data skipping.
Clustering is the complement to partitioning: while partitioning eliminates entire directories of files based on coarse-grained partition predicates, clustering eliminates individual files within partitions based on fine-grained column value ranges.
Why Clustering Matters
Without clustering, data within a partition is written in arrival order (random/insertion order). Consider a partition containing 1,000 data files, each with customer_id values randomly spread across all customers. A query for WHERE customer_id = 12345 must scan all 1,000 files because every file’s customer_id range includes 12345.
With clustering on customer_id:
- All rows for
customer_id = 12345are in the same few files. - The min/max statistics for
customer_idin most files exclude 12345 entirely. - The engine skips 990+ files and reads only the few that contain customer 12345.
Clustering is Applied via Compaction
Clustering is not a write-time operation — it is applied retroactively during compaction (data file rewriting). The compaction job sorts data by the clustering key and writes new files, each containing a contiguous range of key values.
-- Sort-based compaction in Spark: cluster by customer_id
CALL system.rewrite_data_files(
table => 'db.orders',
strategy => 'sort',
sort_order => 'customer_id ASC NULLS LAST',
options => map(
'target-file-size-bytes', '268435456' -- 256MB target
)
);
Sort Orders in Iceberg
Iceberg supports defining a sort order for a table — the column(s) by which data should be physically ordered when written or compacted. The sort order is stored in table metadata and used by compaction jobs:
-- Define a sort order for the table
ALTER TABLE db.orders WRITE ORDERED BY customer_id, order_date;
-- View current sort orders
SELECT * FROM db.orders.metadata;
When a sort order is defined:
- Compaction jobs use it automatically without specifying
sort_orderin the CALL. - Some engines (like Spark) can use it as a hint during writes to produce pre-sorted output.
Clustering Strategies
Single-Column Linear Sort
Best for tables with one dominant filter column:
sort_order => 'customer_id ASC'
Excellent for WHERE customer_id = X, poor for WHERE product_id = Y.
Multi-Column Linear Sort
For tables with a primary + secondary filter column:
sort_order => 'region ASC, customer_id ASC'
Good for WHERE region = 'us-east' AND customer_id = X, but poor for queries on customer_id without region.
Z-Order (Multi-Dimensional Clustering)
For tables with multiple filter columns used in various combinations:
sort_order => 'zorder(customer_id, product_id, order_date)'
Good for any combination of the listed columns. Best all-purpose approach for multi-predicate workloads. See Z-Order Clustering for details.
Partitioning vs. Clustering
| Aspect | Partitioning | Clustering |
|---|---|---|
| Granularity | Coarse (eliminates whole partitions) | Fine (eliminates individual files) |
| Applied at | Write time (automatic) | Compaction time (retroactive) |
| Storage impact | Directory structure | File content ordering |
| Query requirement | No filter required (hidden partitioning) | No filter required (stats-based) |
| Effectiveness | Eliminates most files for partition predicates | Reduces files for value-range predicates |
The optimal table layout uses both: partition by a coarse time dimension (day, month) to eliminate most partitions for time-range queries, then cluster by frequently-filtered columns within each partition to eliminate most files within the remaining partitions.
Clustering and Column Statistics Quality
The effectiveness of clustering is directly tied to how tight the column statistics (min/max bounds) become after clustering:
- Perfect clustering: Every file contains exactly one customer’s rows → min==max for customer_id in each file → 100% data skipping for any specific customer query.
- Good clustering: Files contain a narrow range of customer IDs → tight bounds → high skip rate.
- No clustering: Files contain all customer IDs → wide bounds → no meaningful skipping.
Monitoring the width of column min/max bounds via SELECT * FROM db.orders.files gives insight into clustering quality.
Maintenance Schedule
Clustering must be reapplied periodically as new data arrives and disrupts the clustered layout:
- Streaming tables: Compact hourly or daily to re-cluster newly appended data.
- Batch-loaded tables: Compact after each batch load.
- Low-write tables: Monthly or quarterly clustering refresh.