Google Cloud and Apache Iceberg

Google Cloud's Apache Iceberg stack integrates BigQuery, Cloud Storage, Biglake Metastore, and Cloud Dataplex to provide a fully managed, governed Iceberg lakehouse on GCP, with BigLake Managed Tables supporting multi-engine access via the Iceberg REST Catalog API.

Google Cloud Platform (GCP) has invested significantly in Apache Iceberg as the open table format for its cloud analytics stack. The full GCP Iceberg offering spans four integrated services: BigQuery (query engine), Cloud Storage (object storage), BigLake Metastore (REST Catalog), and Cloud Dataplex (governance).

Together, these form Google’s answer to the open lakehouse: a fully managed, AI-integrated analytics platform where Iceberg tables are first-class citizens.

The GCP Iceberg Architecture

GCP Iceberg Stack:

Cloud Storage (GCS)         ← Iceberg data files (Parquet)
     │
BigLake Metastore          ← Iceberg REST Catalog
     │
BigQuery                   ← Primary query engine (SQL)
     │
Cloud Dataplex             ← Data governance, lineage, quality
     │
Vertex AI                  ← ML/AI on Iceberg data
     │
Spark on Dataproc          ← ETL, streaming ingestion

Cloud Storage as Iceberg Storage

Google Cloud Storage (GCS) is the object storage foundation. Iceberg data files (Parquet), metadata files, and manifest files are all stored in GCS buckets. GCS provides:

Multi-region and dual-region storage classes for data residency compliance.
Object lifecycle rules for managing storage costs (transition to Archive for cold snapshots).
Strong consistency guarantees (required for Iceberg’s atomic metadata operations).
IAM-based access control integrated with BigLake Metastore credential vending.

gs://my-lakehouse-bucket/warehouse/
  ├── analytics/orders/           ← Iceberg table location
  │   ├── data/                   ← Parquet data files
  │   └── metadata/               ← Metadata and manifest files
  └── analytics/customers/

BigLake Metastore as the REST Catalog

BigLake Metastore is Google’s managed implementation of the Iceberg REST Catalog specification. It serves as the central catalog for all Iceberg tables managed by BigQuery and accessible to external engines.

Key capabilities:

Full REST Catalog API: Any Iceberg-compatible engine connects via standard REST.
Credential vending: Vends scoped GCS service account tokens to authenticated engines.
IAM integration: Catalog access governed by Google IAM roles.
Table sharing: Tables registered in BigLake Metastore are accessible to Spark (Dataproc), PyIceberg, and Dremio.

Cloud Dataplex for Governance

Google Cloud Dataplex is the data governance and management layer:

Data scanning: Profile Iceberg tables, detect schema drift, measure data quality.
Metadata discovery: Auto-discover Iceberg tables in GCS and register in BigLake Metastore.
Data lineage: Track transformations between Iceberg tables via Cloud Dataflow and Dataproc jobs.
Data catalog: Tag and classify Iceberg columns (PII, sensitivity, owner).
Policy enforcement: Apply column-level masking and row-level access policies.

Spark on Dataproc + Iceberg

Google Cloud Dataproc is GCP’s managed Spark/Hadoop service. For Iceberg ETL on GCP:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.sql.catalog.biglake", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.biglake.catalog-impl",
            "org.apache.iceberg.gcp.biglake.BigLakeCatalog") \
    .config("spark.sql.catalog.biglake.gcp_project", "my-gcp-project") \
    .config("spark.sql.catalog.biglake.gcp_location", "us-central1") \
    .config("spark.sql.catalog.biglake.blms_catalog", "my_catalog") \
    .config("spark.sql.catalog.biglake.warehouse",
            "gs://my-lakehouse-bucket/warehouse/") \
    .getOrCreate()

# Read from GCS, write to Iceberg via BigLake
df = spark.read.json("gs://raw-bucket/events/")
df.writeTo("biglake.analytics.events").append()

Vertex AI and Iceberg Data

Google Vertex AI can access Iceberg tables for ML training:

Vertex AI Feature Store can use Iceberg tables as offline feature storage.
BigQuery ML can train models directly on BigLake Managed Iceberg tables.
Vertex AI Pipelines can incorporate PyIceberg + GCS steps for feature engineering.

GCP vs. AWS Iceberg Ecosystem

Aspect	GCP	AWS
Storage	GCS	S3
REST Catalog	BigLake Metastore	S3 Tables REST, Glue
Primary query engine	BigQuery	Athena, EMR
Governance	Dataplex	Lake Formation
AI/ML	Vertex AI	SageMaker
Managed Iceberg	BigLake Managed Tables	S3 Tables
Open Catalog standard	Yes (REST API)	Yes (REST API)

Both clouds provide complete, managed Iceberg stacks. The choice between GCP and AWS typically follows existing cloud commitments and team expertise rather than Iceberg-specific capability differences.

Google Cloud and Apache Iceberg