Skip to content
Operations & Optimization Last updated: May 14, 2026

Iceberg Table Compaction

Iceberg compaction is the maintenance process of merging small data files into optimally sized files, applying pending delete files, and rewriting manifests to maintain query performance and reduce metadata overhead over time.

iceberg compactioniceberg table optimizationiceberg small file compactioniceberg rewrite data filesiceberg maintenance

Iceberg Table Compaction

Compaction is the most critical ongoing maintenance operation for Apache Iceberg tables. Over time, high-frequency streaming writes and micro-batch jobs create many small data files, while UPDATE and DELETE operations accumulate delete files. Without compaction, query performance degrades and storage costs increase.

Compaction is the process of:

  1. Rewriting small data files into optimally sized files (typically 128MB–512MB).
  2. Applying accumulated delete files into the rewritten data files, producing clean files with no pending deletes.
  3. Rewriting manifests to reduce manifest file count and metadata overhead.

After compaction, the same data exists in fewer, larger, cleaner files — dramatically improving both query performance and metadata efficiency.

Why Compaction is Necessary

The Small File Problem

Every Iceberg write transaction produces at least one new data file. Streaming pipelines writing micro-batches every 30 seconds produce 2,880 files per day per partition. Even a moderately partitioned table accumulates tens of thousands of small files quickly.

Small files harm performance because:

Delete File Accumulation

UPDATE and DELETE operations (in Merge-on-Read mode) write small delete files that are applied at read time. As delete files accumulate, every read must apply more and more deletes — a join-like operation on top of every scan. Without compaction, reads degrade proportionally with delete file count.

Compaction Operations

Data File Rewriting

The core compaction operation combines multiple small files into larger ones:

-- Spark: rewrite small data files
CALL system.rewrite_data_files(
  table => 'db.orders',
  strategy => 'sort',
  sort_order => 'zorder(customer_id, order_date)',
  options => map(
    'min-file-size-bytes', '67108864',   -- 64MB
    'max-file-size-bytes', '536870912',  -- 512MB
    'target-file-size-bytes', '268435456' -- 256MB target
  )
);

Options include:

Manifest Rewriting

Compaction also includes rewriting manifests to reduce manifest count and update partition statistics:

-- Spark: rewrite manifests for reduced metadata overhead
CALL system.rewrite_manifests(
  table => 'db.orders',
  use_caching => true
);

Delete File Application (Positional Delete Removal)

When compacting with the sort strategy, Iceberg applies all pending positional and equality delete files to the new data files, producing clean output with no pending deletes.

Compaction in Dremio

Dremio provides automatic table optimization as part of its Agentic Lakehouse platform. Dremio’s OPTIMIZE TABLE command handles compaction intelligently:

-- Dremio: optimize an Iceberg table
OPTIMIZE TABLE db.orders;

-- With options
OPTIMIZE TABLE db.orders
REWRITE DATA USING BIN_PACK
(TARGET_FILE_SIZE_MB = 256, MIN_FILE_SIZE_MB = 64, MAX_FILE_SIZE_MB = 512);

Dremio can also be configured for automatic background optimization, eliminating the need for manual maintenance schedules.

Compaction Strategies

StrategyDescriptionWhen to Use
Bin-packCombine files to hit target sizeGeneral purpose, fastest
SortSort by columns + combineWhen query patterns filter on sortable columns
Z-OrderMulti-column sort (Z-curve)Multi-column filter predicates

Scheduling Compaction

For production tables, compaction should run on a regular schedule. Common approaches:

The right compaction frequency depends on write volume and query SLAs. Tables with very high write frequency may need hourly compaction; batch-loaded tables may only need daily or weekly compaction.

📚 Go Deeper on Apache Iceberg

Alex Merced has authored three hands-on books covering Apache Iceberg, the Agentic Lakehouse, and modern data architecture. Pick up a copy to master the full ecosystem.

← Back to Iceberg Knowledge Base