Search
Medallion Architecture
The databricks medallion architecture is not really an architecture, but more Approach or Pattern with three data stages: bronze, silver, and gold.
Image Source by Jorrit Sandbrink
- The bronze data stage stores the data in its original form from source systems. It uses Table Formats for storage, focuses on quick Change Data Capture, and provides historical archiving and data lineage.
- The silver data stage contains cleaned and transformed data blended from the bronze stage. It applies “just enough” transformations, provides an “Enterprise view” of key business entities, and enables self-service analytics.
- The gold data stage implements the analytics model with consumption-ready, project-specific databases. It uses denormalized, read-optimized data models and applies final transformations and data quality rules. The data model is usually a Star Schema, with Facts (transactional data) and Dimensions (descriptive attributes) typically defined and optimized at this layer.
Data flows through the layers from dirty to clean, normalized to denormalized, and granular to aggregated. The gold layer often represents the final stage of this transformation.
# Benefits and Considerations
- Simple and easy-to-understand data model based on Object-Store and simple distributed files.
- Enables incremental ETL
- Allows recreation of tables from raw data at any time
- Supports ACID transactions and time travel capabilities
- Flexible architecture that can be adapted to specific needs (e.g., adding more layers)
# Historical Context
The Medallion Architecture, created and announced by Databricks, can be seen as an evolution of Classical Architecture of Data Warehouse, with layers such as stage -> cleansing -> core -> mart
, but optimized for Data Lakes and modern data processing needs.
Simon Whiteley has another great overview that combines the two and argues that every company has different requirements and, therefore, different layers. Not each layer of the medallion architecture must have only one layer, as shown in the image, it can contain multiple:
Image source from
Behind the Hype - The Medallion Architecture Doesn’t Work - YouTube
graph LR %% Input sources B[Batch] --> Bronze S[Streaming] --> Bronze %% Main flow subgraph Bronze[Bronze Layer] B1[Raw Integration
Landing zone
No schema needed] end subgraph Silver[Silver Layer] S1[Filtered, Cleaned
Augmented
Define & evolve schema] end subgraph Gold[Gold Layer] G1[Business-oriented
Denormalized
Clean data delivery] end subgraph Platinum[Platinum Layer] P1[Semantic Layer
Aggregated
Sub-seconds] end %% Connections between layers Bronze -->|cleaning|Silver Silver -->|organize| Gold Gold --> |curate|Platinum %% Output connections Platinum --> Excel[Excel] Platinum --> BI[BI] Platinum --> ML[ML/AI] Platinum --> Apps[Data Apps] %% Styling classDef default fill:#f9f9f9,stroke:#333,stroke-width:1px classDef platinum fill:#5f9ea0,color:white class Platinum platinum
^b4b488
# Implementation
Databricks provides tools like Delta Live Tables (DLT) that allow users to build data pipelines with Bronze, Silver, and Gold tables using minimal code. These pipelines can be built on Apache Spark Structured Streaming for real-time data processing.
# Variations
As with the Classical Architecture of Data Warehouse, Medallion Architectures can vary in their layers.
- Sometimes, you have two, three, or a maximum of four, depending on how much cleaning and complex data sources you have
- Some tables might not fit perfectly into a single layer
- Adaptable to specific organizational needs and data complexities
Origin:
Iceberg + Spark + Trino + Dagster: modern, open-source data stack demo | by ZD | Jul, 2022 | Dev Genius
References: Trivadis Data Warehouse Layers.
Created 2022-08-16