🧠 Second Brain
Search
Backfill
“Backfilling is where you see the difference between a data engineer and a great data engineer.” — Christophe Blefari on LinkedIn
Given a persistent immutable staging area and pure tasks, in theory it’s possible to recompute the state of the entire warehouse from scratch (not that you should), and get to the exact same state. Knowing this, the retention policy on derived tables can be shorter, knowing that it’s possible to backfill historical data at will.
# What’s a backfill?
A backfill is when you take a data asset that’s normally updated incrementally and update historical parts of it.
For example, you have a table, and each day, you add records to it that correspond to events that happened during that day. Backfilling the table means filling in or overwriting data for days in the past.
We use the term “Data Assets” instead of only “table” because not all pipelines operate on tabular data. You might also backfill a set of images that you use for training a vision model. Or backfill a set of ML models.
# Why backfill your data?
You typically run a backfill if you’re in one of these situations:
- Greenfield - you’ve added data assets to your data pipeline
- Brownfield - one of the assets in your data pipeline has changed
- Recovering from failure
# Backfills gone wrong
Backfills can go wrong in a few different ways:
- Targeting the wrong subset - If you neglect to backfill parts of your data that need to be backfilled, you risk finishing with the false impression that your data is up-to-date. If you backfill parts of your data that don’t need to be backfilled, then you’ve wasted some time and money. If you backfill a data asset without backfilling the data assets derived from it, you risk ending up with your data in an inconsistent and confusing state.
- Resource overload - Backfills can require significant amounts of memory and processing power. This can cause them to overwhelm your system or starve important workloads.
- Cost overload - A large backfill might end up costing much more than expected.
- Getting lost in the middle - If parts of your backfill fail, you can end up in a state where you know something has gone wrong but don’t know what, and need to restart from the beginning.
To avoid these issues, it’s essential to plan and execute backfills carefully.
Read more on Backfills in Data & Machine Learning: A Primer.
Origin:
References: Tenacity (Retrying library for Python)
Created 2022-08-26