Iceberg and Data Mesh Architecture
Data mesh is a sociotechnical approach to data architecture that distributes data ownership to domain teams, treats data as a product, enforces federated governance with a global interoperability standard, and provides self-service infrastructure. Apache Iceberg is the natural table format for data mesh implementations because it provides:
- A universal storage standard all domain data products can use.
- An interoperability layer (Iceberg REST Catalog) for cross-domain data discovery.
- Governance primitives (credential vending, RBAC via catalog) for federated access control.
- Immutable snapshots for data product versioning and SLA guarantees.
The Four Data Mesh Principles Mapped to Iceberg
Principle 1: Domain Ownership → Domain-Owned Iceberg Tables
Each domain team owns their Iceberg tables in their own catalog namespace or even their own catalog instance:
Apache Polaris Catalog
├── orders-domain/ ← owned by Orders team
│ ├── orders
│ ├── order_line_items
│ └── order_events
├── customers-domain/ ← owned by Customer team
│ ├── customers
│ ├── customer_segments
│ └── customer_events
└── products-domain/ ← owned by Product team
├── products
├── product_catalog
└── inventory
Domain teams:
- Have write access to their own namespace.
- Manage their own schema, partitioning, and compaction schedules.
- Define which tables are “data products” (published for cross-domain access).
Principle 2: Data as a Product → Published Iceberg Tables with SLAs
A data product in Iceberg terms is a curated, well-documented Iceberg table (or view) with:
- A committed schema contract (versioned schema via Iceberg schema evolution history).
- SLA guarantees (data freshness, data quality, availability).
- Documentation (catalog table properties, descriptions, column-level metadata).
- Consumer notification on breaking schema changes.
-- Publish a data product: set metadata on the Iceberg table
ALTER TABLE orders-domain.orders SET TBLPROPERTIES (
'data-product.owner' = 'orders-team@company.com',
'data-product.sla.freshness-hours' = '4',
'data-product.description' = 'Canonical order table. Source of truth for all order analytics.',
'data-product.version' = '2.0.0'
);
Principle 3: Self-Serve Infrastructure → Iceberg REST Catalog
The Iceberg REST Catalog (Apache Polaris) provides the self-serve infrastructure layer:
- Discovery: Consumers browse available data products via catalog namespaces.
- Access request: Consumers request access; catalog RBAC grants scoped permissions.
- Credential vending: Consumers receive temporary, scoped object storage credentials.
- Engine choice: Consumers use any Iceberg-compatible engine (Dremio, Spark, PyIceberg) to access data products.
No domain team needs to maintain dedicated data sharing pipelines. The catalog handles all cross-domain access.
Principle 4: Federated Computational Governance → Catalog-Level RBAC
Governance policies are enforced at the catalog layer, not per-domain:
Global governance:
- PII columns must be masked for non-authorized consumers
- Only consumers in "finance" role can access revenue data products
- Audit all cross-domain data access
Domain governance:
- Orders team controls write access to orders-domain namespace
- Orders team defines data product SLAs and quality standards
Catalog-level RBAC (in Apache Polaris / Dremio Open Catalog) enforces the global policies while domain teams enforce domain-specific rules.
Cross-Domain Data Product Access
# Consumer (Dremio / AI agent) accessing Orders team's data product
from pyiceberg.catalog import load_catalog
catalog = load_catalog("polaris", **{
"type": "rest",
"uri": "https://my-company.polaris.dremio.com",
"credential": "consumer-team:secret", # scoped credentials for read-only access
})
# Discover available data products in orders domain
for table_id in catalog.list_tables("orders-domain"):
table = catalog.load_table(table_id)
print(table.properties.get("data-product.description", ""))
# Access the data product
orders = catalog.load_table("orders-domain.orders")
df = orders.scan(selected_fields=("order_id", "order_date", "total")).to_arrow()
Iceberg Data Mesh with Dremio
Dremio’s Agentic Lakehouse is particularly well-suited for the data mesh model:
- Open Catalog (Apache Polaris) serves as the cross-domain catalog.
- AI Semantic Layer adds semantic context to each domain’s data products.
- Virtual datasets define clean, governed interfaces over raw domain tables.
- AI Agent discovers and queries cross-domain data products autonomously.
The result is an agentic data mesh: AI agents can discover, understand, and query data products across domains using the semantic layer — making the mesh accessible to both human analysts and AI systems.
Data Mesh Anti-Patterns to Avoid with Iceberg
- Copying data between domains: Use the REST Catalog for zero-copy cross-domain access; never replicate Iceberg tables across domain namespaces.
- Schema coupling: Define stable data product schemas; consumers should not depend on raw table internals.
- Bypassing the catalog: All access must flow through the catalog for governance; direct S3 access defeats federated governance.