Search
Delta Load a Data Warehouse
Delta load is the process of loading data from your source OLTP database, the operational database where the application is running. This database is modeled in a relational fashion, normalized into 100s of database tables. With the delta load, we want to have an efficient way of transferring these data into the data warehouse, modeling, and transforming the data so we can extract critical business metrics.
There are these options to load the delta: Each approach has its advantages and disadvantages. My favorite for a data warehouse is number 2. Additional columns give us the best performance and most efficient way of operationalizing ETL.
| Approach | Pro | Contra | |
|---|---|---|---|
| 1 | WAL-Log See an example in Postgres WAL Implementation |
Write-ahead logs are the most efficient way as we do not stress the operational database when loading data into our warehouse, and we do not need to figure out changes as the WAL technique does that for us. We are not reading the database per se, but the separated logs. | - A big disadvantage is that the implementation of such a process is quite complex, except if we use a tool that works with it out of the box (e.g., Airbyte, Debezium), but that adds another layer of complexity. - After all of this complexity, it wouldn’t work for an initial load, nor in a case of error. This means we would end up implementing different ways of replicating data depending on which mode we are running. |
| 2 | Additional columns (created_at, updated_at) This approach is a little more work, as we need to query each table to find out the latest changes. Due to the added columns, it’s fairly efficient as we can filter the new data easily by the date filter. |
The biggest advantage in this approach is, that the full load and the delta load have the same approach. One with a filter on the date, and the other without any filter. | - The downside, the application needs to add these columns to each table, and with each DML (INSERT, Updated, Delete), it needs to update these date columns. |
| 3 | Database Triggers Create database triggers that insert changes to a dedicated table whenever data changes |
We will catch all updates, even deletes, and write them into a single log table | - It will add a potential error risk and load on the operational database and application as every table will have triggers that act on each insert/update/delete - The log table can get huge as all changes are written into a single table leading to a potential bottleneck |
| 4 | Generic Scan and Hash table approach Scan all source tables and rows, generate a hash, and compare if it’s already in the data warehouse. |
Very versatile and works are simple in complexity | - Does not scale as we’d need to always make a full table scan and calculate hashes over the full database - Most is done in-memory in Pandas in a single cursor, which does not scale either |
Origin: Load incremental and Idempotency
References: Postgres
Created 2024-01-25