Google Cloud and Apache Iceberg
Google Cloud Platform (GCP) has invested significantly in Apache Iceberg as the open table format for its cloud analytics stack. The full GCP Iceberg offering spans four integrated services: BigQuery (query engine), Cloud Storage (object storage), BigLake Metastore (REST Catalog), and Cloud Dataplex (governance).
Together, these form Google’s answer to the open lakehouse: a fully managed, AI-integrated analytics platform where Iceberg tables are first-class citizens.
The GCP Iceberg Architecture
GCP Iceberg Stack:
Cloud Storage (GCS) ← Iceberg data files (Parquet)
│
BigLake Metastore ← Iceberg REST Catalog
│
BigQuery ← Primary query engine (SQL)
│
Cloud Dataplex ← Data governance, lineage, quality
│
Vertex AI ← ML/AI on Iceberg data
│
Spark on Dataproc ← ETL, streaming ingestion
Cloud Storage as Iceberg Storage
Google Cloud Storage (GCS) is the object storage foundation. Iceberg data files (Parquet), metadata files, and manifest files are all stored in GCS buckets. GCS provides:
- Multi-region and dual-region storage classes for data residency compliance.
- Object lifecycle rules for managing storage costs (transition to Archive for cold snapshots).
- Strong consistency guarantees (required for Iceberg’s atomic metadata operations).
- IAM-based access control integrated with BigLake Metastore credential vending.
gs://my-lakehouse-bucket/warehouse/
├── analytics/orders/ ← Iceberg table location
│ ├── data/ ← Parquet data files
│ └── metadata/ ← Metadata and manifest files
└── analytics/customers/
BigLake Metastore as the REST Catalog
BigLake Metastore is Google’s managed implementation of the Iceberg REST Catalog specification. It serves as the central catalog for all Iceberg tables managed by BigQuery and accessible to external engines.
Key capabilities:
- Full REST Catalog API: Any Iceberg-compatible engine connects via standard REST.
- Credential vending: Vends scoped GCS service account tokens to authenticated engines.
- IAM integration: Catalog access governed by Google IAM roles.
- Table sharing: Tables registered in BigLake Metastore are accessible to Spark (Dataproc), PyIceberg, and Dremio.
Cloud Dataplex for Governance
Google Cloud Dataplex is the data governance and management layer:
- Data scanning: Profile Iceberg tables, detect schema drift, measure data quality.
- Metadata discovery: Auto-discover Iceberg tables in GCS and register in BigLake Metastore.
- Data lineage: Track transformations between Iceberg tables via Cloud Dataflow and Dataproc jobs.
- Data catalog: Tag and classify Iceberg columns (PII, sensitivity, owner).
- Policy enforcement: Apply column-level masking and row-level access policies.
Spark on Dataproc + Iceberg
Google Cloud Dataproc is GCP’s managed Spark/Hadoop service. For Iceberg ETL on GCP:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.sql.catalog.biglake", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.biglake.catalog-impl",
"org.apache.iceberg.gcp.biglake.BigLakeCatalog") \
.config("spark.sql.catalog.biglake.gcp_project", "my-gcp-project") \
.config("spark.sql.catalog.biglake.gcp_location", "us-central1") \
.config("spark.sql.catalog.biglake.blms_catalog", "my_catalog") \
.config("spark.sql.catalog.biglake.warehouse",
"gs://my-lakehouse-bucket/warehouse/") \
.getOrCreate()
# Read from GCS, write to Iceberg via BigLake
df = spark.read.json("gs://raw-bucket/events/")
df.writeTo("biglake.analytics.events").append()
Vertex AI and Iceberg Data
Google Vertex AI can access Iceberg tables for ML training:
- Vertex AI Feature Store can use Iceberg tables as offline feature storage.
- BigQuery ML can train models directly on BigLake Managed Iceberg tables.
- Vertex AI Pipelines can incorporate PyIceberg + GCS steps for feature engineering.
GCP vs. AWS Iceberg Ecosystem
| Aspect | GCP | AWS |
|---|---|---|
| Storage | GCS | S3 |
| REST Catalog | BigLake Metastore | S3 Tables REST, Glue |
| Primary query engine | BigQuery | Athena, EMR |
| Governance | Dataplex | Lake Formation |
| AI/ML | Vertex AI | SageMaker |
| Managed Iceberg | BigLake Managed Tables | S3 Tables |
| Open Catalog standard | Yes (REST API) | Yes (REST API) |
Both clouds provide complete, managed Iceberg stacks. The choice between GCP and AWS typically follows existing cloud commitments and team expertise rather than Iceberg-specific capability differences.