Iceberg Data Mesh Architecture

A data mesh on Apache Iceberg uses Iceberg tables as the storage standard for domain-owned data products, with the Iceberg REST Catalog providing discoverability and access governance across domain boundaries in a federated, decentralized architecture.

Iceberg and Data Mesh Architecture

Data mesh is a sociotechnical approach to data architecture that distributes data ownership to domain teams, treats data as a product, enforces federated governance with a global interoperability standard, and provides self-service infrastructure. Apache Iceberg is the natural table format for data mesh implementations because it provides:

A universal storage standard all domain data products can use.
An interoperability layer (Iceberg REST Catalog) for cross-domain data discovery.
Governance primitives (credential vending, RBAC via catalog) for federated access control.
Immutable snapshots for data product versioning and SLA guarantees.

The Four Data Mesh Principles Mapped to Iceberg

Principle 1: Domain Ownership → Domain-Owned Iceberg Tables

Each domain team owns their Iceberg tables in their own catalog namespace or even their own catalog instance:

Apache Polaris Catalog
  ├── orders-domain/         ← owned by Orders team
  │   ├── orders
  │   ├── order_line_items
  │   └── order_events
  ├── customers-domain/      ← owned by Customer team
  │   ├── customers
  │   ├── customer_segments
  │   └── customer_events
  └── products-domain/       ← owned by Product team
      ├── products
      ├── product_catalog
      └── inventory

Domain teams:

Have write access to their own namespace.
Manage their own schema, partitioning, and compaction schedules.
Define which tables are “data products” (published for cross-domain access).

Principle 2: Data as a Product → Published Iceberg Tables with SLAs

A data product in Iceberg terms is a curated, well-documented Iceberg table (or view) with:

A committed schema contract (versioned schema via Iceberg schema evolution history).
SLA guarantees (data freshness, data quality, availability).
Documentation (catalog table properties, descriptions, column-level metadata).
Consumer notification on breaking schema changes.

-- Publish a data product: set metadata on the Iceberg table
ALTER TABLE orders-domain.orders SET TBLPROPERTIES (
    'data-product.owner' = 'orders-team@company.com',
    'data-product.sla.freshness-hours' = '4',
    'data-product.description' = 'Canonical order table. Source of truth for all order analytics.',
    'data-product.version' = '2.0.0'
);

Principle 3: Self-Serve Infrastructure → Iceberg REST Catalog

The Iceberg REST Catalog (Apache Polaris) provides the self-serve infrastructure layer:

Discovery: Consumers browse available data products via catalog namespaces.
Access request: Consumers request access; catalog RBAC grants scoped permissions.
Credential vending: Consumers receive temporary, scoped object storage credentials.
Engine choice: Consumers use any Iceberg-compatible engine (Dremio, Spark, PyIceberg) to access data products.

No domain team needs to maintain dedicated data sharing pipelines. The catalog handles all cross-domain access.

Principle 4: Federated Computational Governance → Catalog-Level RBAC

Governance policies are enforced at the catalog layer, not per-domain:

Global governance:
  - PII columns must be masked for non-authorized consumers
  - Only consumers in "finance" role can access revenue data products
  - Audit all cross-domain data access

Domain governance:
  - Orders team controls write access to orders-domain namespace
  - Orders team defines data product SLAs and quality standards

Catalog-level RBAC (in Apache Polaris / Dremio Open Catalog) enforces the global policies while domain teams enforce domain-specific rules.

Cross-Domain Data Product Access

# Consumer (Dremio / AI agent) accessing Orders team's data product
from pyiceberg.catalog import load_catalog

catalog = load_catalog("polaris", **{
    "type": "rest",
    "uri": "https://my-company.polaris.dremio.com",
    "credential": "consumer-team:secret",  # scoped credentials for read-only access
})

# Discover available data products in orders domain
for table_id in catalog.list_tables("orders-domain"):
    table = catalog.load_table(table_id)
    print(table.properties.get("data-product.description", ""))

# Access the data product
orders = catalog.load_table("orders-domain.orders")
df = orders.scan(selected_fields=("order_id", "order_date", "total")).to_arrow()

Iceberg Data Mesh with Dremio

Dremio’s Agentic Lakehouse is particularly well-suited for the data mesh model:

Open Catalog (Apache Polaris) serves as the cross-domain catalog.
AI Semantic Layer adds semantic context to each domain’s data products.
Virtual datasets define clean, governed interfaces over raw domain tables.
AI Agent discovers and queries cross-domain data products autonomously.

The result is an agentic data mesh: AI agents can discover, understand, and query data products across domains using the semantic layer — making the mesh accessible to both human analysts and AI systems.

Data Mesh Anti-Patterns to Avoid with Iceberg

Copying data between domains: Use the REST Catalog for zero-copy cross-domain access; never replicate Iceberg tables across domain namespaces.
Schema coupling: Define stable data product schemas; consumers should not depend on raw table internals.
Bypassing the catalog: All access must flow through the catalog for governance; direct S3 access defeats federated governance.