Medallion Architecture with Apache Iceberg
The Medallion Architecture is a data design pattern that organizes a data lakehouse into three progressive refinement layers — Bronze (raw), Silver (cleaned/normalized), and Gold (business-ready) — with each layer transforming data into increasingly trusted, performant, and use-case-specific tables.
Apache Iceberg is the ideal table format for the Medallion Architecture because its ACID guarantees, schema evolution, and time travel capabilities provide reliability and observability at every layer of the pipeline.
The Three Layers
Bronze Layer: Raw Ingestion
The Bronze layer captures raw, unmodified data exactly as it arrives from source systems. No transformations, no cleaning, no filtering.
Characteristics:
- Data appended directly from Kafka, CDC streams, file uploads, or API calls.
- Schema matches the source system schema (or raw Kafka record schema).
- All historical data retained indefinitely (or per retention policy).
- Used for reprocessing: if Silver or Gold logic changes, Bronze is the source of truth for replay.
Iceberg properties at Bronze:
- Merge-on-Read with frequent Flink commits → low ingestion latency.
- Schema preserved exactly as received.
- Snapshot history enables replay from any historical point.
-- Bronze table: raw orders
CREATE TABLE bronze.orders (
raw_payload STRING, -- raw JSON or entire source row
ingested_at TIMESTAMP,
source_system STRING
) USING iceberg
PARTITIONED BY (days(ingested_at));
Silver Layer: Cleaned and Normalized
The Silver layer applies data quality rules, type casting, deduplication, and normalization to produce a clean, queryable representation of each entity.
Characteristics:
- CDC upserts applied (Bronze raw events merged into Silver fact/dimension tables).
- Schema normalized to match the business data model.
- Data quality rules enforced (null checks, range validation, deduplication).
- Queryable by data analysts and data scientists.
Iceberg properties at Silver:
- MERGE INTO for CDC apply and deduplication.
- Schema evolution as business model evolves.
- Time travel for debugging data quality issues.
-- Silver table: cleaned orders (CDC upsert target)
CREATE TABLE silver.orders (
order_id BIGINT NOT NULL,
customer_id BIGINT,
order_date DATE,
total DECIMAL(10,2),
status STRING,
region STRING,
updated_at TIMESTAMP
) USING iceberg
PARTITIONED BY (months(order_date));
-- Periodic Bronze → Silver CDC apply
MERGE INTO silver.orders AS target
USING bronze_normalized AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT VALUES (...);
Gold Layer: Business-Ready Aggregates
The Gold layer contains pre-aggregated, business-metric-aligned tables optimized for specific analytical use cases: executive dashboards, BI tools, ML feature stores, and AI agent queries.
Characteristics:
- Aggregated, denormalized, or pivoted data.
- Optimized for read performance (compacted, Z-order clustered).
- Refreshed on a schedule (daily, hourly) or on-demand.
- Served to BI tools (Dremio, Tableau, Power BI) and AI agents.
Iceberg properties at Gold:
- Copy-on-Write mode for maximum read performance.
- Z-order clustering for multi-column query patterns.
- Regular compaction to maintain file size and column statistics.
- Reflections (in Dremio) for sub-second query performance.
-- Gold table: daily revenue by region and product category
CREATE TABLE gold.daily_revenue (
revenue_date DATE,
region STRING,
product_category STRING,
total_revenue DECIMAL(18,2),
order_count BIGINT,
avg_order_value DECIMAL(10,2)
) USING iceberg
PARTITIONED BY (months(revenue_date))
TBLPROPERTIES ('write.merge.mode' = 'copy-on-write');
Full Pipeline with Iceberg
Source Systems (DBs, APIs, Kafka)
│
▼ (Flink streaming, batch ingestion)
Bronze Layer (raw Iceberg tables, MoR, append-only)
│
▼ (Spark batch CDC apply, MERGE INTO)
Silver Layer (cleaned Iceberg tables, schema-enforced)
│
▼ (Spark aggregation, Dremio virtual datasets)
Gold Layer (business-ready Iceberg tables, CoW, clustered)
│
▼
Dremio Intelligent Query Engine → BI Tools, AI Agents, Self-serve analytics
Why Iceberg Enables the Medallion Pattern
| Requirement | How Iceberg Fulfills It |
|---|---|
| Raw data preservation | Immutable snapshots at Bronze → replay any point in history |
| CDC application at Silver | MERGE INTO with MoR → fast upsert without rewrites |
| Schema evolution | Safe schema changes at each layer without pipeline breakage |
| Audit trail | Snapshot history at every layer |
| Performance at Gold | CoW + compaction + Z-order clustering + Dremio Reflections |
| Cross-layer consistency | Atomic snapshot commits → each layer always consistent |
| AI/agent readiness | Semantic layer over Gold tables → AI agents understand data |