Search
Load incremental and Idempotency
Compared to the nightly loads of traditional data warehouses, we need to load incrementally. This makes your data more modular and manageable, especially when you have a Star Schema. Fact tables can only be appended and dimensions only need to scan the newest transactions instead of the entire fact table.
With the incremental approach, you switch from batch to event-driven. Your updates and inserts are independent, and you get autonomous batches. If you succeed in switching, you get a near real-time analytics solution which you can scale and parallelise those batches.
Another approach is enforcing idempotency which is vital for the operability of pipelines and helps mainly in two ways. It guarantees that you can re-run your pipeline and it will produce the same result every time. On the other hand data scientist and analytics, people rely on point-in-time snapshots and perform historical analyses. Meaning your data should not be mutable as time progress. Otherwise, we would get different results as time goes on. That’s why pipelines should be built to reproduce the same output when running with the same (business) logic and time interval. This is called idempotency and is used in functional programming, which was the role model for idempotency.
A nice side effect of event-driven and incremental loading is that you can eliminate the Lambda Architecture. One single data flow for batch and stream left which is perfectly in-line with Delta Lake. Well suited and a good fit for that purpose is Spark Structured Streaming with the integrated option micro-batching. That way you benefit from both worlds, as you stream and can set the latency to 0, but you can also reduce speed in processing. For example when you have a batch per hour to mitigate the overhead for looking up dimensions or specifically aggregations, which you need to perform otherwise for each stream (which might take time as well).
# Further Reads
Origin:
Business Intelligence meets Data Engineering with Emerging Technologies | ssp.sh
References: idempotency
Created 2020-06-14