Apache ORC in Apache Iceberg
Apache ORC (Optimized Row Columnar) is a columnar storage format developed by Hortonworks (now part of Cloudera) and now maintained as an Apache top-level project. Apache Iceberg supports ORC as one of its three natively supported data file formats (alongside Parquet and Avro), giving teams flexibility in their choice of storage format.
While Parquet is the dominant format for Iceberg tables in most modern deployments, ORC remains relevant for organizations migrating from Hive ORC workloads and certain specific use cases.
ORC vs. Parquet in Iceberg
Both ORC and Parquet are columnar formats with similar goals. The differences matter primarily in legacy contexts:
| Feature | ORC | Parquet |
|---|---|---|
| Origin | Hortonworks (Hive ecosystem) | Twitter + Cloudera (broad ecosystem) |
| Compression | Zlib, Snappy, LZ4, Zstd | Gzip, Snappy, LZ4, Zstd |
| Column statistics | Min/max, null count, sum, distinct count | Min/max, null count |
| Native ACID | Yes (Hive ACID relied on ORC) | No (Iceberg handles ACID) |
| Index structures | Row index, bloom filters | Row group statistics, bloom filters |
| Nested types | Good | Excellent (Parquet Dremel encoding) |
| Ecosystem support | Strong in Hive/Hadoop | Broader (Spark, Dremio, Trino, Arrow) |
| Iceberg adoption | Minority | Dominant |
When ORC Appears in Iceberg
1. Hive ORC Migration (Most Common)
Organizations migrating from Hive tables where ORC was the default (Hive 0.x–3.x defaulted to ORC for ACID tables) will produce Iceberg tables with ORC data files after an in-place migration:
-- After migrate, the table is Iceberg but data files are still ORC
CALL spark_catalog.system.migrate('db.hive_orc_table');
-- Table is now Iceberg with ORC data files (valid — Iceberg reads ORC natively)
Post-migration, you can leave the ORC files in place (Iceberg reads them) or convert to Parquet via compaction:
-- Convert ORC files to Parquet during compaction
ALTER TABLE db.orders SET TBLPROPERTIES ('write.format.default' = 'parquet');
CALL system.rewrite_data_files(
table => 'db.orders',
options => map('write-format', 'parquet')
);
-- Old ORC files are replaced with Parquet during compaction
2. Specific ORC Advantages
ORC’s richer built-in statistics (sum, distinct count at the stripe level) can benefit some query planners that are optimized for ORC. In practice, Iceberg’s manifest-level statistics supersede most of these advantages.
3. Hive Compatibility
If the same Iceberg table must remain readable by Apache Hive (which has excellent ORC native support), keeping ORC format ensures maximum Hive compatibility.
Writing ORC Iceberg Tables
-- Create a new Iceberg table with ORC format
CREATE TABLE db.events (
event_id BIGINT,
user_id BIGINT,
event_ts TIMESTAMP,
payload STRING
) USING iceberg
TBLPROPERTIES ('write.format.default' = 'orc');
-- Per-write ORC override
INSERT INTO db.orders /*+ FORMAT('orc') */
SELECT * FROM staging.new_orders;
ORC Compression in Iceberg
ORC supports multiple compression codecs:
ALTER TABLE db.events SET TBLPROPERTIES (
'write.format.default' = 'orc',
'write.orc.compression-codec' = 'zstd', -- or: zlib, snappy, lz4, none
'write.orc.compression-strategy' = 'speed' -- or: compression
);
Recommendation: ORC vs. Parquet for New Tables
For new Iceberg table creation:
- Default to Parquet: Better ecosystem support, broader tool compatibility, better nested type encoding, lower overhead for most analytics workloads.
- Use ORC if: You’re migrating from Hive ORC (leave in place for initial migration, convert during compaction), or your query engine has documented superior performance with ORC.
For mixed-format tables (some ORC files, some Parquet after migration), Iceberg handles the format mix transparently — each data file entry in the manifest records its own format, and the reader handles format detection per file.