Apache Hive and Apache Iceberg
Apache Hive — historically the dominant SQL query engine for Hadoop-based data lakes — has evolved to support Apache Iceberg as a first-class table format. Hive 4.x and later treat Iceberg tables as native objects, enabling full Hive SQL DML operations against Iceberg tables while maintaining backward compatibility with legacy Hive SerDe tables.
For organizations with large Hive investments (Hive warehouses, Hive SQL workflows, HiveServer2 infrastructure), native Iceberg support provides an upgrade path to modern lakehouse capabilities without immediately replacing all Hive infrastructure.
Hive 4.x Iceberg Support
Hive 4.0 (released 2023) introduced comprehensive Iceberg support:
- Full DDL: CREATE TABLE, ALTER TABLE, DROP TABLE.
- Full DML: INSERT, INSERT OVERWRITE, UPDATE, DELETE, MERGE INTO.
- Time travel: SELECT with snapshot ID or timestamp.
- Metadata tables: access to
$snapshots,$history,$files. - Schema evolution: ADD COLUMN, RENAME COLUMN, DROP COLUMN.
- Partition evolution: ALTER TABLE SET PARTITION SPEC.
Creating Iceberg Tables in Hive
-- Create an Iceberg table via Hive SQL
CREATE TABLE orders (
order_id BIGINT,
customer_id BIGINT,
total DOUBLE,
order_date DATE,
status STRING
)
STORED BY ICEBERG -- Iceberg storage handler
STORED AS PARQUET
LOCATION 'hdfs:///warehouse/orders/'
TBLPROPERTIES (
'format-version' = '2', -- always use v2
'partitioning' = 'months(order_date)'
);
Full DML in Hive
-- INSERT
INSERT INTO orders VALUES (1001, 42, 150.00, '2026-05-14', 'pending');
-- UPDATE (row-level, MoR)
UPDATE orders SET status = 'shipped' WHERE order_id = 1001;
-- DELETE
DELETE FROM orders WHERE order_date < '2020-01-01';
-- MERGE INTO (upsert)
MERGE INTO orders AS target
USING updates AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET status = source.status, total = source.total
WHEN NOT MATCHED THEN INSERT VALUES (
source.order_id, source.customer_id, source.total,
source.order_date, source.status
);
Time Travel in Hive
-- Query as of a specific snapshot ID
SELECT * FROM orders FOR SYSTEM_VERSION AS OF 8027658604211071520;
-- Query as of a timestamp
SELECT * FROM orders FOR SYSTEM_TIME AS OF '2026-05-14 10:00:00';
-- View snapshot history
SELECT * FROM orders$snapshots;
Hive Metastore as Iceberg Catalog
The Hive Metastore (HMS) serves as the Iceberg catalog for Hive-managed Iceberg tables. HMS stores the mapping from table name to current Iceberg metadata file location, updated on each commit.
This is the same HMS used by Spark, Flink, and Trino for Iceberg tables when configured with the hive catalog type — enabling multi-engine access to Hive-registered Iceberg tables.
Hive + Iceberg in the Migration Context
For Hive shop migrations to a modern lakehouse:
Step 1: Migrate existing Hive tables to Iceberg format (in-place, no data copy). Step 2: Continue using Hive SQL for existing workflows — they now run against Iceberg. Step 3: Gradually add Spark, Dremio, or Trino for workloads where Hive is slow. Step 4: Optionally migrate the catalog from HMS to Apache Polaris when ready.
This incremental approach allows organizations to adopt Iceberg’s benefits without a “rip and replace” migration.
Hive vs. Modern Engines for Iceberg
| Aspect | Hive 4.x | Spark | Dremio |
|---|---|---|---|
| Iceberg support | Full (v4.x+) | Full | Full |
| SQL compatibility | HiveQL | Spark SQL | ANSI SQL / DremioSQL |
| Interactive queries | Slow (batch-oriented) | Moderate | Excellent (sub-second) |
| AI integration | No | No | Yes (AI Semantic Layer) |
| Best for | Legacy migration, batch ETL | Large-scale ETL/ML | Analytics, AI, BI |
Hive’s role in the modern lakehouse is as a migration bridge — enabling organizations to adopt Iceberg incrementally without abandoning their Hive investments, while transitioning analytics workloads to faster, more capable engines over time.