🧠 Second Brain

Search

Search IconIcon to open search

Slowly Changing Dimension - Type 2 (SCD2)

Last updated Feb 18, 2025

Slowly Changing Dimension Type 2 (SCD2) is a fundamental Data Warehousing technique that preserves the historical changes in dimensional data over time.

Unlike Type 1 which overwrites old values, SCD2 maintains a complete history by creating new records whenever tracked attributes change, while keeping the previous versions.

Each record contains effective dates or version numbers to indicate when it was active, making it possible to accurately reconstruct the state of data at any point in time. This is particularly valuable in business intelligence where understanding historical changes in customer information, product details, or organizational structures is crucial for accurate reporting and analysis.

# Key Implementation Questions

When implementing SCD2, consider these essential questions:

  1. In a few words, what is the use case for your Slowly Changing Dimension implementation?
  2. Would you be more interested in SCD type 2 or type 4?
  3. Are you interested to know when records are deleted from the origin source or are non-destructive updates enough?
  4. If you need to know about deletes, how will this info most likely be generated:
    1. The origin data has a soft delete flag, so records are still present but marked as deleted.
    2. The origin data will just not have a row/data item with a given primary key anymore, so dlt will need to figure this out by syncing the full dataset
    3. The origin data provides some kind of log of all changes (such as a write ahead log from postgres) which dlt should stream into the destination dataset and take delete info from there.
  5. Roughly what size does the origin dataset have that you want to load into your data warehouse with scd2 (MB, GB, or TB)?
  6. When scd2 is what you need, what additional fields would you expect to be in the destination table? One possibility is a valid_from and valid_until column which indicates when this row is the current row for the given primary key. A row with an empty valid_until column would be the current row, and a primary key that does not have any rows with an empty valid_until field would be a deleted record. Source, dlt

# SCD2 vs. Snapshotting

Snapshotting your dimensions for a variety of reasons, and for reasons that are not immediately related to this entity-centric approach. The practice can be described as keeping a full snapshot of your dimension, for every single day in the past. This is a lazy alternative to slowly changing dimension modeling techniques that offer the following benefits/tradeoffs:

  • [+] Easy to manage and maintain, minimizing mutations and fitting the functional approach to data engineering
  • [+] Easy point-in-time querying / comparisons
  • [+] Easy time-series analysis on how entities are evolving over time
  • [-] Increased, super redundant storage of information - but who cares in the infinite cheap storage/compute world, especially given dimensions being relatively small in relation to facts

Source: RW Introducing Entity-Centric Data Modeling for Analytics

# Implementation Methods (Examples)

SCD2 method tracks historical data by creating multiple records for a given  natural key in the dimensional tables with separate Surrogate Keys and/or different version numbers.

Unlimited history is preserved for each insert. The natural key in these examples is the “Supplier_Code” of “ABC”.

# Version-Based Tracking

This method uses sequential version numbers to track changes. For example, when a supplier changes location:

Supplier_Key Supplier_Code Supplier_Name Supplier_State Version
123 ABC Acme Supply Co CA 0
124 ABC Acme Supply Co IL 1
125 ABC Acme Supply Co NY 2

# Effective Date Tracking

Uses date ranges to track changes:

Supplier_Key Supplier_Code Supplier_Name Supplier_State Start_Date End_Date
123 ABC Acme Supply Co CA 2000-01-01T00:00:00 2004-12-22T00:00:00
124 ABC Acme Supply Co IL 2004-12-22T00:00:00 NULL

The Start date/time of the second row is equal to the End date/time of the previous row. The null End_Date in row two indicates the current tuple version. A standardized surrogate high date (e.g. 9999-12-31) may instead be used as an end date, so that the field can be included in an index, and so that null-value substitution is not required when querying. In some database software, using an artificial high date value could cause performance issues, that using a null value would prevent.

# Current Flag Method

Combines effective date with a current status flag:

Supplier_Key Supplier_Code Supplier_Name Supplier_State Effective_Date Current_Flag
123 ABC Acme Supply Co CA 2000-01-01T00:00:00 N
124 ABC Acme Supply Co IL 2004-12-22T00:00:00 Y

The Current_Flag value of ‘Y’ indicates the current tuple version.

More details available on Slowly changing dimension - Wikipedia.

# Merge Scripts

/Users/sspaeti/Simon/Sync/3 Resources/Wiki Files/Business Intelligence/MICROSOFT BI/SCD2


Origin: Slowly Changing Dimension
References: Slowly Changing Type 2 with Databricks
Created 2022-06-29