Skip to content
Cloud-Specific Integrations Last updated: May 14, 2026

Google Cloud and Apache Iceberg

Google Cloud's Apache Iceberg stack integrates BigQuery, Cloud Storage, Biglake Metastore, and Cloud Dataplex to provide a fully managed, governed Iceberg lakehouse on GCP, with BigLake Managed Tables supporting multi-engine access via the Iceberg REST Catalog API.

google cloud iceberggcp apache iceberggoogle cloud storage icebergcloud dataplex iceberggoogle cloud lakehouse iceberg

Google Cloud and Apache Iceberg

Google Cloud Platform (GCP) has invested significantly in Apache Iceberg as the open table format for its cloud analytics stack. The full GCP Iceberg offering spans four integrated services: BigQuery (query engine), Cloud Storage (object storage), BigLake Metastore (REST Catalog), and Cloud Dataplex (governance).

Together, these form Google’s answer to the open lakehouse: a fully managed, AI-integrated analytics platform where Iceberg tables are first-class citizens.

The GCP Iceberg Architecture

GCP Iceberg Stack:

Cloud Storage (GCS)         ← Iceberg data files (Parquet)

BigLake Metastore          ← Iceberg REST Catalog

BigQuery                   ← Primary query engine (SQL)

Cloud Dataplex             ← Data governance, lineage, quality

Vertex AI                  ← ML/AI on Iceberg data

Spark on Dataproc          ← ETL, streaming ingestion

Cloud Storage as Iceberg Storage

Google Cloud Storage (GCS) is the object storage foundation. Iceberg data files (Parquet), metadata files, and manifest files are all stored in GCS buckets. GCS provides:

gs://my-lakehouse-bucket/warehouse/
  ├── analytics/orders/           ← Iceberg table location
  │   ├── data/                   ← Parquet data files
  │   └── metadata/               ← Metadata and manifest files
  └── analytics/customers/

BigLake Metastore as the REST Catalog

BigLake Metastore is Google’s managed implementation of the Iceberg REST Catalog specification. It serves as the central catalog for all Iceberg tables managed by BigQuery and accessible to external engines.

Key capabilities:

Cloud Dataplex for Governance

Google Cloud Dataplex is the data governance and management layer:

Spark on Dataproc + Iceberg

Google Cloud Dataproc is GCP’s managed Spark/Hadoop service. For Iceberg ETL on GCP:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.sql.catalog.biglake", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.biglake.catalog-impl",
            "org.apache.iceberg.gcp.biglake.BigLakeCatalog") \
    .config("spark.sql.catalog.biglake.gcp_project", "my-gcp-project") \
    .config("spark.sql.catalog.biglake.gcp_location", "us-central1") \
    .config("spark.sql.catalog.biglake.blms_catalog", "my_catalog") \
    .config("spark.sql.catalog.biglake.warehouse",
            "gs://my-lakehouse-bucket/warehouse/") \
    .getOrCreate()

# Read from GCS, write to Iceberg via BigLake
df = spark.read.json("gs://raw-bucket/events/")
df.writeTo("biglake.analytics.events").append()

Vertex AI and Iceberg Data

Google Vertex AI can access Iceberg tables for ML training:

GCP vs. AWS Iceberg Ecosystem

AspectGCPAWS
StorageGCSS3
REST CatalogBigLake MetastoreS3 Tables REST, Glue
Primary query engineBigQueryAthena, EMR
GovernanceDataplexLake Formation
AI/MLVertex AISageMaker
Managed IcebergBigLake Managed TablesS3 Tables
Open Catalog standardYes (REST API)Yes (REST API)

Both clouds provide complete, managed Iceberg stacks. The choice between GCP and AWS typically follows existing cloud commitments and team expertise rather than Iceberg-specific capability differences.

📚 Go Deeper on Apache Iceberg

Alex Merced has authored three hands-on books covering Apache Iceberg, the Agentic Lakehouse, and modern data architecture. Pick up a copy to master the full ecosystem.

← Back to Iceberg Knowledge Base