Search
Snapshotting
Contrary to SCD2, which evolves over time, snapshotting captures the state of data at specific time intervals, much like Materialized Views.
Similar capabilities can also be achieved with Time Travel in Data Lake Table Formats.
# How it Works
You only snapshotting the Dimensions (facts are event-based anyway) - after a month, you keep one monthly snapshot (end of month), delete the days, and do the same at the end of the year.
This way, you can jump back, and it shouldn’t be too expensive. You loose specific days during a month, but that should be OK for most use cases. It’s always a tradeoff.
# From Maximes Talks
Maxime Beauchemin advocates for the use of partitions in snapshotting. He suggests maintaining two separate tables for each dimension: dimension and dimension_history. He emphasizes:
“The most recent partition is especially valuable as it reflects the current state. Employing table partitioning strategies and creating a view that points directly to the latest partition ensures easy and optimal data access. Effective naming conventions are crucial here, as exemplified by
core.user_historyandcore.user.”
From his Functional Data Engineering talk, he explains snapshotting well:
From the transcript:
Instead of using traditional Slowly Changing Dimension techniques (Type 1: overwrite history, Type 2: surrogate keys with effective dates, Type 3: extra columns), the functional approach is to simply snapshot the entire dimension table for every ETL run.
So if you run daily jobs, you get a complete copy of
dim_supplierfor each day. Yes, this means massive duplication — but the tradeoffs work in your favor:
- Storage is cheap, engineering time is expensive. Dimension tables are typically small relative to fact tables, so even thousands of daily snapshots are trivial for modern engines like BigQuery or Presto.
- The mental model is dead simple. Want the current state? Join on the key and filter to the latest partition. Want the state at the time of a transaction? Join on the key and the partition date. No surrogate key lookups, no effective date management.
- Reproducibility. Since partitions are immutable, you can always recreate any historical state. Run the same query on the same partition, get the same result.
- Time series for free. You can now do queries like “how many suppliers did we have over time?” with a simple
GROUP BY date— something that’s awkward with SCD Type 2.The core idea ties back to Functional Programming: treat partitions as immutable objects. Never update or delete — only insert new snapshots. This makes your ETL idempotent (safe to re-run) and deterministic (same inputs → same outputs).
Origin: Data Modeling