Search
Classical Architecture of Data Warehouse
This is the classical data warehouse architecture I learned at the beginning of my career. And to this day, I like to model my data warehouse this way.
From
Data Warehouse Blueprints: Business Intelligence in der Praxis : Schnider, Dani, Jordan, Claus, Welker, Peter, Wehner, Joachim: Amazon.de: Bücher, September 2016, created Trivadis. ^437a59
Other views, e.g. from the Medallion Architecture:
Image source from
Behind the Hype - The Medallion Architecture Doesn’t Work - YouTube
A modern take with a Semantic Layer or OLAP Cubes, almost like the old days in the above classical architecture:
# Example with Orchestration and Modeling tools
What is your go-to modeling architecture? I like the one below; if it’s complex, I’d like to add cleansing between staging and core.
The example contains tools such as orchestration, modeling, data ingestion and Data Modeling Techniques such as SCD2 and Dimensional Modeling (Kimball):
flowchart TB subgraph DWH_Architecture["Modern Data Stack Architecture"] direction LR subgraph DataFlow["Data Flow"] direction LR subgraph Source["Data Source"] DB[(Source Postgres)] TableA[Table_A] TableAHistory[Table_A_History] WALLogs[WAL-Logs] DB --- TableA DB --- TableAHistory DB --- WALLogs end subgraph Integration["Data Integration (dlt)"] Arrow["Arrow (in-memory)"] Norm["normalization schema"] Load["load"] LoadTables["Adding Load-Tables"] Arrow --> Norm Norm --> Load Load --> LoadTables end subgraph Warehouse["Data Warehouse"] direction LR subgraph LZ["staging (stg)"] Address["Address\n...\nload_date"] AddressHist["Address_History\n...\nload_date"] Car["Car\n...\nload_date"] CarHist["Car_History\n...\nload_date"] end subgraph Core["core (historization)"] AddressCore["Address\nvalid_from\nvalid_to"] CarCore["Car\nvalid_from\nvalid_to"] TheoryExam["Theory_Exam\ncreated_at"] LicensePlate["License_Plate\nvalid_from\nvalid_to"] MergeSCD2["Merge SCD2"] end subgraph Mart["data mart (Star Schema)"] direction LR subgraph Facts["Fact Tables"] CarFact["fact_car\n- metrics\n- foreign keys"] TheoryFact["fact_theory_exam\n- metrics\n- foreign keys"] end subgraph Dims["Dimension Tables"] DimAddress["dim_address\n- attributes\n- surrogate key"] DimLicense["dim_license_plate\n- attributes\n- surrogate key"] DimDate["dim_date\n- date attributes\n- surrogate key"] end end end subgraph Tools["Tools & Orchestration"] direction TB subgraph Orchestration["Dagster: Orchestration"] LoadDag["Load DAG"] ModelingDag["Modeling DAG"] LoadDag --> ModelingDag end subgraph Modeling["dbt: Data Modeling"] LZModels["Landing Zone Models"] CoreModels["Core Models"] MartModels["Mart Models"] LZModels --> CoreModels CoreModels --> MartModels end end end end %% Main data flow connections Source --> Integration Integration --> LZ Address --> AddressHist Car --> CarHist LZ --> MergeSCD2 MergeSCD2 --> AddressCore MergeSCD2 --> CarCore MergeSCD2 --> TheoryExam MergeSCD2 --> LicensePlate AddressCore --> CarFact AddressCore --> TheoryFact CarCore --> CarFact LicensePlate --> CarFact TheoryExam --> TheoryFact %% Dimension relationships CarFact --- DimAddress CarFact --- DimLicense CarFact --- DimDate TheoryFact --- DimAddress TheoryFact --- DimDate %% Data Modeling Flow LZ -.->|"dbt run"| LZModels Core -.->|"dbt run"| CoreModels Mart -.->|"dbt run"| MartModels %% Orchestration LoadDag -.->|"triggers"| Integration ModelingDag -.->|"triggers"| Modeling %%classDef source fill:#1a202c,color:#ffffff %% Dark blue-gray with white text %%classDef integration fill:#2d3748,color:#ffffff %% Slightly lighter blue-gray %%classDef warehouse fill:#2a4365,color:#ffffff %% Navy blue %%classDef core fill:#2c5282,color:#ffffff %% Medium blue %%classDef mart fill:#2b6cb0,color:#ffffff %% Light blue %%classDef modeling fill:#1a365d,color:#ffffff %% Dark blue %%classDef orchestration fill:#233876,color:#ffffff %% Royal blue class Source source class Integration integration class Warehouse warehouse class Core core class Mart mart class Modeling modeling class Orchestration orchestration
# Layers
In this overview, we’ll delve into each layer of a complete Data Warehouse (DWH) architecture and explore why this modeling approach is effective:
# Staging Area
Staging Area: This initial layer serves as the landing point for data from various source systems.
- Here, data is stored in its original form, as delivered.
- The structure of the staging tables mirrors the source system interfaces, with no inter-table relationships.
- Each table holds data from the latest delivery, which is cleared before the next batch.
- Think of it like a grocery store’s loading dock, where suppliers (source systems) drop off goods (data) for temporary storage.
# Cleansing Area
Cleansing Area: Prior to integration into the Core, data undergoes cleaning in the Cleansing Area.
- This involves filtering out, correcting, or supplementing faulty data.
- Data from different sources is transformed and unified.
- Similar to the Staging Area, only the most recent data batch is kept here.
- A grocery store analogy would be the preparation area where goods are made ready for sale, undergoing quality control and labeling.
# Core
Core: Data from various sources converges in the Core, having passed through the Staging and Cleansing areas, and is stored long-term, often for years.
- The Core’s main function is integrating data from diverse sources and organizing it thematically.
- It’s the go-to source for Data Marts and should generally not be accessed directly by users.
# Data Mart
Data Mart: Marts store subsets of Core data, optimized for user queries.
- Each Mart caters to specific applications or user groups, simplifying query complexity and enhancing system usability.
- These can be likened to specialized market stalls in a grocery store, each offering a curated selection of goods.
# Metatdata
Metadata: The foundation of the DWH system, metadata, is essential for its smooth operation.
- Business metadata includes descriptions of attributes, drill paths, and aggregation rules for front-end applications.
- Technical metadata covers data structures, mapping rules, and ETL control parameters.
- Operational metadata encompasses log tables, error messages, ETL process logs, and more, essentially forming the DWH’s infrastructure.
While not every Data Warehouse adheres strictly to this structure, with some areas merged or renamed, the essential concept is to segment the system for task specialization. This segmentation facilitates data cleaning, integration, historization, and query handling, simplifying the transformation processes between layers.
Origin:
References: Medallion Architecture
Created 2023-04-27