🧠 Second Brain

Search

Search IconIcon to open search

Backfill

Last updated Feb 9, 2024

“Backfilling is where you see the difference between a data engineer and a great data engineer.” — Christophe Blefari on LinkedIn

Given a persistent immutable staging area and pure tasks, in theory it’s possible to recompute the state of the entire warehouse from scratch (not that you should), and get to the exact same state. Knowing this, the retention policy on derived tables can be shorter, knowing that it’s possible to backfill historical data at will.

# What’s a backfill?

A backfill is when you take a data asset that’s normally updated incrementally and update historical parts of it.

For example, you have a table, and each day, you add records to it that correspond to events that happened during that day. Backfilling the table means filling in or overwriting data for days in the past.

We use the term “Data Assets” instead of only “table” because not all pipelines operate on tabular data. You might also backfill a set of images that you use for training a vision model. Or backfill a set of ML models.

# Why backfill your data?

You typically run a backfill if you’re in one of these situations:

# Backfills gone wrong

Backfills can go wrong in a few different ways:

To avoid these issues, it’s essential to plan and execute backfills carefully.


Read more on the awesome example on Backfills in Data & Machine Learning: A Primer by the Dagster team.


Origin:
References:
Created 2022-08-26