Skip to content
Agentic & AI Last updated: May 14, 2026

Iceberg AI Readiness

Iceberg AI readiness describes the architectural properties that make Apache Iceberg tables ideal for AI and machine learning workloads: immutable snapshot reproducibility, schema-on-read flexibility, governed access via the REST catalog, and Python-native access through PyIceberg.

iceberg ai readinessiceberg machine learningiceberg ml featuresai iceberg integrationiceberg for data science

Iceberg AI Readiness

Iceberg AI readiness refers to the set of architectural properties in Apache Iceberg that make it particularly well-suited as the data foundation for AI and machine learning workloads. These properties go beyond standard analytics requirements — they address specific challenges that AI and ML teams face when working with large-scale datasets.

Key Properties That Make Iceberg AI-Ready

1. Reproducible Training Datasets via Snapshots

ML models must be reproducible: given the same training data, the same model should be produced. Iceberg’s immutable snapshot system enables this naturally.

When training a model:

  1. Record the snapshot ID used for training.
  2. Store this snapshot ID alongside the model artifact.
  3. To reproduce: load the table at the exact snapshot → identical training data.
from pyiceberg.catalog import load_catalog

catalog = load_catalog("my_catalog", **{...})
table = catalog.load_table("ml.training_features")

# Get current snapshot ID before training
training_snapshot_id = table.current_snapshot().snapshot_id
print(f"Training on snapshot: {training_snapshot_id}")

# Load training data
df = table.scan(snapshot_id=training_snapshot_id).to_arrow().to_pandas()

# Train model...
# Record training_snapshot_id in MLflow/metadata

# Later: reproduce the exact training dataset
reproduce_df = table.scan(snapshot_id=training_snapshot_id).to_arrow().to_pandas()

Even if the production table has been updated hundreds of times since training, the original training snapshot remains accessible until explicitly expired.

2. Schema Evolution Without Pipeline Breakage

ML feature pipelines break when source tables change schema. Iceberg’s schema evolution is backward-compatible:

This is crucial for long-running ML systems that must survive table schema changes without emergency pipeline updates.

3. Python-Native Access via PyIceberg

The ML/data science ecosystem is Python-first. PyIceberg provides direct Python access to Iceberg tables without requiring Spark or JVM:

import pyarrow as pa
from pyiceberg.catalog import load_catalog

# No Spark, no JVM, no cluster needed
catalog = load_catalog("my_catalog", type="rest", uri="...")
features = catalog.load_table("ml.user_features").scan(
    selected_fields=("user_id", "feature_1", "feature_2", "label")
).to_arrow()

# Direct to PyTorch DataLoader
import torch
from torch.utils.data import Dataset

class IcebergDataset(Dataset):
    def __init__(self, arrow_table):
        self.data = arrow_table
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return torch.tensor(self.data.slice(idx, 1).to_pandas().values[0])

4. Feature Store Foundation

Iceberg is increasingly used as the storage layer for feature stores:

# Point-in-time correct feature retrieval (prevent future leakage)
user_features = table.scan(
    snapshot_id=snapshot_at_training_time,
    row_filter="event_date <= '2026-01-01'"
).to_arrow()

5. Governed Access via REST Catalog

AI pipelines need controlled data access:

6. Interoperability with the ML Ecosystem

Iceberg integrates with the ML ecosystem via:

IntegrationUse Case
PyArrowHigh-performance columnar data for ML
PandasData exploration and feature engineering
DuckDBSQL feature queries without Spark
RayDistributed ML training on Iceberg data
Hugging Face DatasetsVia Arrow table bridge
MLflowLog snapshot IDs with model artifacts
FeastFeature store on Iceberg offline store

The Agentic AI Data Stack

For AI agents that need to query, reason over, and act on data:

  1. Apache Iceberg: Open, reproducible, governed data storage.
  2. PyIceberg / MCP Server: Python-native or agent-native data access.
  3. Apache Polaris: Catalog for discovery and access control.
  4. Dremio AI Semantic Layer: Business context for agent understanding.
  5. Dremio AI Agent: Autonomous analytics execution.

This stack provides AI agents with governed access to structured, versioned, contextualized data — the foundation for trustworthy agentic analytics.

📚 Go Deeper on Apache Iceberg

Alex Merced has authored three hands-on books covering Apache Iceberg, the Agentic Lakehouse, and modern data architecture. Pick up a copy to master the full ecosystem.

← Back to Iceberg Knowledge Base