Skip to content
Engines & Integrations Last updated: May 14, 2026

PyIceberg: Python Library for Apache Iceberg

PyIceberg is the official Python library for Apache Iceberg, providing a pure-Python client for reading, writing, and managing Iceberg tables without requiring Spark or the JVM, enabling Python-native data engineering and ML workflows.

pyicebergpython icebergiceberg python librarypyiceberg tutorialiceberg pandas arrow python

PyIceberg: Python Library for Apache Iceberg

PyIceberg is the official Python client library for Apache Iceberg, maintained as part of the Apache Iceberg project. It provides a pure-Python API for:

PyIceberg is the correct choice for Python data engineering workflows that don’t require Spark’s distributed processing. It’s significantly lighter weight, faster to set up, and more Python-idiomatic.

Installation

pip install "pyiceberg[s3fs,glue]"    # AWS with S3 storage
pip install "pyiceberg[adlfs,azure]"  # Azure
pip install "pyiceberg[gcs]"          # GCP
pip install "pyiceberg[duckdb]"       # Local with DuckDB SQL

Connecting to a Catalog

REST Catalog (Apache Polaris, Dremio Open Catalog, AWS Glue REST)

from pyiceberg.catalog import load_catalog

catalog = load_catalog(
    "my_catalog",
    **{
        "type": "rest",
        "uri": "https://my-catalog.example.com",
        "credential": "client-id:client-secret",
    }
)

AWS Glue Catalog

catalog = load_catalog(
    "glue",
    **{
        "type": "glue",
        "region_name": "us-east-1",
    }
)

Local / Development (SQL Catalog with DuckDB)

catalog = load_catalog(
    "local",
    **{
        "type": "sql",
        "uri": "sqlite:///local_catalog.db",
        "warehouse": "file:///tmp/iceberg-warehouse",
    }
)

Reading Iceberg Tables

# Load a table
table = catalog.load_table("db.orders")

# Full table scan → PyArrow Table
arrow_table = table.scan().to_arrow()

# Convert to Pandas
df = arrow_table.to_pandas()

# Filter pushdown (predicates pushed to Iceberg manifest scanning)
from pyiceberg.expressions import GreaterThanOrEqual, LessThan, And

filtered = table.scan(
    row_filter=And(
        GreaterThanOrEqual("order_date", "2026-01-01"),
        LessThan("order_date", "2026-06-01")
    ),
    selected_fields=("order_id", "customer_id", "total"),
).to_arrow()

Writing Data

import pyarrow as pa

# Append new data
new_data = pa.table({
    "order_id": [1001, 1002, 1003],
    "customer_id": [42, 17, 99],
    "total": [150.00, 289.99, 44.50],
    "order_date": ["2026-05-14", "2026-05-14", "2026-05-14"],
})

table.append(new_data)

# Overwrite a partition
table.overwrite(new_data)

Time Travel Queries

# Load a specific snapshot by ID
snapshot = table.snapshot_by_id(8027658604211071520)
scan = table.scan(snapshot_id=snapshot.snapshot_id)
historical_data = scan.to_arrow()

# Load by timestamp
from datetime import datetime
snap = table.snapshot_as_of_timestamp(
    int(datetime(2026, 1, 1).timestamp() * 1000)  # milliseconds
)

SQL via DuckDB Integration

PyIceberg integrates with DuckDB for SQL-based querying:

import duckdb

# Register the Iceberg table with DuckDB
conn = duckdb.connect()
table = catalog.load_table("db.orders")

# Read via PyIceberg to Arrow, then query with DuckDB
arrow_table = table.scan().to_arrow()
conn.register("orders", arrow_table)

result = conn.execute("""
    SELECT customer_id, SUM(total) as revenue
    FROM orders
    WHERE order_date >= '2026-01-01'
    GROUP BY customer_id
    ORDER BY revenue DESC
    LIMIT 10
""").fetchdf()

Schema and Metadata Operations

# Inspect table schema
print(table.schema())

# List all snapshots
for snap in table.snapshots():
    print(snap.snapshot_id, snap.timestamp_ms, snap.operation)

# Inspect data files
for df in table.scan().plan_files():
    print(df.file.file_path, df.file.record_count)

PyIceberg and the Agentic Lakehouse

PyIceberg is the natural integration point for AI agents and LLM-driven data workflows:

For Python-first AI and data engineering teams, PyIceberg is the fastest path to Iceberg integration.

📚 Go Deeper on Apache Iceberg

Alex Merced has authored three hands-on books covering Apache Iceberg, the Agentic Lakehouse, and modern data architecture. Pick up a copy to master the full ecosystem.

← Back to Iceberg Knowledge Base