Hive Metastore Catalog for Apache Iceberg
The Hive Metastore (HMS) is the original and most widely deployed metadata catalog in the Hadoop ecosystem, and it serves as one of the earliest (and still widely used) catalog implementations for Apache Iceberg. It provides a familiar entry point for teams migrating from Hive tables to Iceberg, since HMS is already deployed in most enterprise data platforms.
What the Hive Metastore Does
At its core, the Hive Metastore is a relational database (backed by MySQL, PostgreSQL, Derby, or Oracle) that stores metadata about tables, databases, partitions, and column definitions. For traditional Hive tables, HMS stores the full schema, partition values, and storage location in its database.
For Iceberg tables, HMS is used more minimally: it stores the table location and a special metadata_location property that points to the current Iceberg metadata file in object storage. The Iceberg catalog implementation in HMS writes and updates this property on every table commit.
HMS as an Iceberg Catalog: How It Works
When using HMS as an Iceberg catalog:
- HMS stores the Iceberg table’s metadata file location in its database as a table property.
- On every Iceberg commit, the catalog implementation atomically updates
metadata_locationin HMS using a database transaction. - Readers query HMS for the table’s
metadata_location, then traverse the Iceberg metadata hierarchy in object storage.
The HMS provides the atomic commit primitive through database-level transactions — the same compare-and-swap semantics that Iceberg needs for ACID guarantees.
Configuration: Spark with HMS Iceberg Catalog
spark = SparkSession.builder \
.config("spark.sql.catalog.hive_prod", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.hive_prod.type", "hive") \
.config("spark.sql.catalog.hive_prod.uri", "thrift://metastore-host:9083") \
.config("spark.sql.catalog.hive_prod.warehouse", "s3://bucket/warehouse/") \
.getOrCreate()
Migrating from Hive Tables to Iceberg via HMS
One of the most practical uses of HMS as an Iceberg catalog is in-place migration of existing Hive tables to Iceberg format:
-- Spark SQL: migrate a Hive table to Iceberg in place
CALL spark_catalog.system.migrate('db.orders');
This call converts the existing Hive table’s metadata in HMS to Iceberg format without copying data files. The existing Parquet files are registered in Iceberg manifests, and the HMS table entry is updated to mark it as an Iceberg table. This enables zero-downtime migration paths.
Limitations of HMS as an Iceberg Catalog
While HMS works for Iceberg, it has important limitations compared to modern catalog options:
-
No REST Catalog API: HMS uses the Thrift protocol, which is JVM-specific and not language-agnostic. Python clients (PyIceberg) and other non-JVM tools require workarounds.
-
Performance: HMS queries involve a roundtrip to a relational database for every table operation. Under high concurrency, the database can become a bottleneck.
-
No credential vending: HMS does not support the REST Catalog credential vending spec.
-
Single point of failure: Without careful HA configuration, HMS is a SPOF for all Iceberg operations.
-
No branching/tagging: HMS has no concept of catalog-level branching (unlike Nessie).
Recommended Migration Path
For most teams, HMS is the right starting point if you already have HMS deployed and are migrating from Hive. Over time, migrating to a REST Catalog implementation (Apache Polaris, AWS Glue, or Project Nessie) provides:
- Broader engine compatibility via the standard REST API
- Better performance and scalability
- Credential vending for multi-engine security
- Language-agnostic client support (Python, Rust, Go)
Migrating between Iceberg catalogs is a metadata-only operation — the underlying data files don’t move. You update the catalog registration, and all engines immediately use the new catalog.