Data Engineering Approaches
Because I encountered these bottlenecks myself, and more frequently lately, I asked myself, how can we:
- Make BI easier?
- Transparent for everyone?
- Effortless to change or add new data or transformations but still have some Data Governance and testing?
- Fast in ad-hoc queries to explore and slice-and-dice your data?
- Have more frequent data loads?
- Simplify following the transformations for all data-savvy people and not exclusive for involved BI engineers.
- Extend additional machine learning capabilities smoothly?
I’m aware that nowadays a lot is going on, especially around open-source tools and frameworks, data ops, and deployments with container-orchestration systems and the like.
Nevertheless, I tried to collect some approaches that helped me make this complex construct more open and ease the overall experience. Some will present themselves as more complicated in the short term, but significantly leaner and less complex over time. You can apply each of them separately, yet the more you use them, the more apparent the flow as a whole will be.
Business Intelligence meets Data Engineering with Emerging Technologies |
References: Use a data lake Use transactional processing Use less of surrogate keys, instead go back to business keys that everyone understands Notebooks Use python (and SQL if possible) Use open-source tools Load incremental and Idempotency Don’t do structure changes (ALTER) in traditional DDL manner Use a container-orchestration system Use declarative pipelining instead of imperative Use data catalogs to have a central metadata store Use closed-source if you don’t have the developers or the time