<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
    <channel>
        <title>All Posts - Data Engineering Blog</title>
        <link>https://www.ssp.sh/posts/</link>
        <description>All Posts | Data Engineering Blog</description>
        <generator>Hugo -- gohugo.io</generator><language>en</language><managingEditor>hello@sspaeti.com (Simon Späti)</managingEditor>
            <webMaster>hello@sspaeti.com (Simon Späti)</webMaster><copyright>All rights reserved. Sharing of excerpts with proper attribution is encouraged for non-commercial purposes. For commercial use or republication, please contact hello@sspaeti.com.</copyright><lastBuildDate>Wed, 08 Apr 2026 00:08:08 &#43;0200</lastBuildDate><atom:link href="https://www.ssp.sh/posts/" rel="self" type="application/rss+xml" /><item>
    <title>Rust for Data Engineering</title>
    <link>https://www.ssp.sh/blog/rust-for-data-engineering/</link>
    <pubDate>Wed, 19 Oct 2022 09:31:09 &#43;0100</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/rust-for-data-engineering/</guid><enclosure url="https://www.ssp.sh/blog/rust-for-data-engineering/feature-rust-vs-python.jpg" type="image/jpeg" length="0" /><description><![CDATA[<p>Will Rust kill Python for Data Engineers? If you only came here to know this, my answer is no. Betteridge&rsquo;s Law strikes again!</p>
<p>But then again, you have to ask: was <em>Python</em> made for Data Engineering in the first place?</p>
<p>Rust may not replace Python outright, but it has consumed more and more of JavaScript tooling and there are increasingly many projects trying to do the same with Python/Data Engineering. Let&rsquo;s explore why Rust has potential for data engineers, what it does well and why it has become the most loved programming language for 7 years running. </p>
<h2 id="what-is-rust">What is Rust?</h2>
<p>Former Mozilla employee Graydon Hoare initially created <a href="https://glossary.airbyte.com/term/rust" target="_blank" rel="noopener noreffer">Rust</a> as a personal project. The first stable release, Rust 1.0 was released on May 15, 2015. Rust is a <strong><a href="https://en.wikipedia.org/wiki/Comparison_of_multi-paradigm_programming_languages" target="_blank" rel="noopener noreffer">multi-paradigm programming language</a></strong> that supports imperative procedural, concurrent actor, object-oriented and pure <a href="https://glossary.airbyte.com/term/functional-programming" target="_blank" rel="noopener noreffer">functional</a> styles, supporting generic programming and metaprogramming statically and dynamically.</p>
<blockquote>The goal of Rust is to be a good programming language for creating highly <strong>concurrent, safe, and performant systems</strong>.</blockquote>
<h2 id="what-is-unique-about-rust">What Is Unique about Rust?</h2>
<p>Rust solves pain points from other programming languages with minimal downsides. With Rust being a compiled programming language, strong type and system checks are enforced during compile-time—meaning pre-runtime! Unlike <a href="https://glossary.airbyte.com/term/python" target="_blank" rel="noopener noreffer">Python</a>&rsquo;s interpreted way, most errors only surface during the coding phase. It can be frustrating to fight every single mistake before being able to test or run a quick script, but at the same time, a prominent feature like the compiler is much faster at finding bugs than me. Additionally, the Rust community puts a lot of effort into making the error messages super informative.</p>













  

























<figure>
<a target="_blank" href="/blog/rust-for-data-engineering/images/rust-compiler-in-action.jpg" title="/blog/rust-for-data-engineering/images/rust-compiler-in-action.jpg">

</a><figcaption class="image-caption">An <a href="https://doc.rust-lang.org/book/ch02-00-guessing-game-tutorial.html" target="_blank" rel="noopener noreffer">example</a> of how Rust surfaces an error during development and suggests changes</figcaption>
</figure>
<div class="details admonition note open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-note"></i>Ownership, Memory Saftey, Reference Borrowing<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">There are much more specifics that differentiate Rust from other programming languages. Concepts such as <a href="https://doc.rust-lang.org/book/ch04-00-understanding-ownership.html" target="_blank" rel="noopener noreffer">Ownership</a> for memory safety, <a href="https://doc.rust-lang.org/book/ch04-02-references-and-borrowing.html" target="_blank" rel="noopener noreffer">Reference Borrowing</a>, and many more, but this article is not meant to be a deep dive into Rust, but rather map it to the field of data engineering.</div>
        </div>
    </div>
<h2 id="why-rust-for-data-engineers">Why Rust for Data Engineers?</h2>
<p>When you write any code, the goal is that the code doesn&rsquo;t break during the weekend or at night when you sleep. Rust will show you errors and improvements while coding and fails as much as possible at compile-time, which is less costly than later in production at run-time.</p>
<p>The go-to language for data engineers is Python, which isn&rsquo;t the most robust or safe language, as many engineers working with data will agree. Rust&rsquo;s developer experience goes much further than just offering a language specification and a compiler; many aspects of creating and maintaining production-quality software are treated as first-class citizens in the Rust ecosystem</p>
<h3 id="what-rust-does-well">What Rust Does Well</h3>
<p>Python is dynamically typed (with only recent support for <a href="https://docs.python.org/3/library/typing.html" target="_blank" rel="noopener noreffer">type hints</a>) and requires writing extensive tests to catch these costly errors. But that takes a lot of time, and you must foresee all potential errors to write a test for it.</p>
<p>Rust is the opposite; it forces you to <strong>define types</strong> (or does it implicitly with <a href="https://doc.rust-lang.org/rust-by-example/types/inference.html" target="_blank" rel="noopener noreffer">type inference</a>) and enforces them. This does not obsolete testing of course, but for example, the rust compiler will analyze e.g. <a href="https://rustc-dev-guide.rust-lang.org/borrow_check.html" target="_blank" rel="noopener noreffer">borrow checking</a>, and does <a href="https://rustc-dev-guide.rust-lang.org/overview.html" target="_blank" rel="noopener noreffer">things</a> to your code that other compilers don&rsquo;t do—check out <a href="https://rust-analyzer.github.io/" target="_blank" rel="noopener noreffer">rust-analyzer</a> for bringing them into your IDE of choice. This makes it very good for data engineers as we have many moving parts such as incoming data sets that we do not control. <strong>Defining expectations</strong> with data types and having vigorous checks at coding and compile time will prevent many errors.</p>
<p>Less relevant for data engineers, but super helpful: <strong>speed</strong>. Rust, as a compiled language, is super fast at run-time. To many, Rust is primarily an alternative to other systems programming languages, like C or C++. But you don&rsquo;t need a systems use case to use a systems language, as both <a href="https://leerob.io/blog/rust" target="_blank" rel="noopener noreffer">Vercel</a> and <a href="https://www.crowdstrike.com/blog/data-science-test-drive-of-rust-programming-language/" target="_blank" rel="noopener noreffer">Crowdstrike</a> are noticing.</p>
<p>Another one is <strong>integrations</strong>. With data pipelines being the glue code in most cases, connecting otherwise foreign systems, Rust almost runs platform agnostic. Rust makes it easy to integrate and communicate with other languages through a so-called <a href="https://en.wikipedia.org/wiki/Foreign_function_interface" target="_blank" rel="noopener noreffer"><em>foreign function interface (FFI)</em></a>. The FFI provides a <strong>zero-cost abstraction</strong> where function calls between Rust and C have identical performance to C function calls. Rust can be called easily from C, Python, Ruby, and vice-versa. Find more on <a href="https://blog.rust-lang.org/2015/04/24/Rust-Once-Run-Everywhere.html" target="_blank" rel="noopener noreffer">Rust Once, Run Everywhere</a>.</p>
<p>A less technical but still important element is to <strong>love</strong> or have <strong>fun</strong>, enjoying your programming language. Rust is a more complex language to learn, but it was the most loved technology for seven years (<a href="https://survey.stackoverflow.co/2022/#section-most-loved-dreaded-and-wanted-programming-scripting-and-markup-languages" target="_blank" rel="noopener noreffer">2022</a>, <a href="https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted" target="_blank" rel="noopener noreffer">2021</a>, <a href="https://insights.stackoverflow.com/survey/2020#most-loved-dreaded-and-wanted" target="_blank" rel="noopener noreffer">2020</a>, <a href="https://insights.stackoverflow.com/survey/2019#technology-_-most-loved-dreaded-and-wanted-languages" target="_blank" rel="noopener noreffer">2019</a>, <a href="https://insights.stackoverflow.com/survey/2018#technology-_-most-loved-dreaded-and-wanted-languages" target="_blank" rel="noopener noreffer">2018</a>, <a href="https://insights.stackoverflow.com/survey/2017#technology-_-most-loved-dreaded-and-wanted-languages" target="_blank" rel="noopener noreffer">2017</a>, <a href="https://insights.stackoverflow.com/survey/2016#technology-most-loved-dreaded-and-wanted" target="_blank" rel="noopener noreffer">2016</a>) in a row on the Stack Overflow survey:</p>













  

























<figure>
<a target="_blank" href="/blog/rust-for-data-engineering/images/love-vs-dreaded-wanted-programming-languages.jpg" title="/blog/rust-for-data-engineering/images/love-vs-dreaded-wanted-programming-languages.jpg">

</a><figcaption class="image-caption">Loved vs. Dreaded and most Wanted Programming Language on StackOverflow Survey 2022</figcaption>
</figure>
<p>Besides the love, it&rsquo;s also rising in awareness of different trends such as <a href="https://trends.google.com/trends/explore?date=today%205-y&amp;q=%2Fm%2F0dsbpg6" target="_blank" rel="noopener noreffer">Google Trend</a>, one from 2019 <a href="http://www.benfrederickson.com/ranking-programming-languages-by-github-users/" target="_blank" rel="noopener noreffer">Ranking on GitHub</a>, or the StackOverflow below:</p>













  

























<figure>
<a target="_blank" href="/blog/rust-for-data-engineering/images/recent-programming-language-trends-stackoverflow.jpg" title="/blog/rust-for-data-engineering/images/recent-programming-language-trends-stackoverflow.jpg">

</a><figcaption class="image-caption"><a href="https://insights.stackoverflow.com/trends" target="_blank" rel="noopener noreffer">StckOverflow Trends</a></figcaption>
</figure>
<div class="details admonition info open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-info"></i>Why Rust is Popular<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">For software engineers, many issues around systems programming are memory errors. Rust&rsquo;s goal is to design a project with quality code management, readability, and quality performance at runtime.</div>
        </div>
    </div>
<h2 id="interesting-open-source-rust-projects">Interesting Open-Source Rust Projects</h2>
<p>The language is always only as good as its community. Let&rsquo;s look at some of the existing open-source tools and frameworks built in and around Rust:</p>
<ul>
<li><a href="https://github.com/apache/arrow-datafusion" target="_blank" rel="noopener noreffer">DataFusion</a> based on <a href="https://glossary.airbyte.com/term/apache-arrow" target="_blank" rel="noopener noreffer">Apache Arrow</a>: Apache Arrow DataFusion SQL Query Engine similar to <a href="https://glossary.airbyte.com/term/apache-spark" target="_blank" rel="noopener noreffer">Spark</a></li>
<li><a href="https://github.com/pola-rs/polars" target="_blank" rel="noopener noreffer">Polars</a>: It&rsquo;s a faster <a href="https://glossary.airbyte.com/term/pandas" target="_blank" rel="noopener noreffer">Pandas</a>. Probably going to compete with <a href="https://glossary.airbyte.com/term/duckdb" target="_blank" rel="noopener noreffer">DuckDB</a> (?)</li>
<li><a href="https://github.com/delta-io/delta-rs" target="_blank" rel="noopener noreffer">Delta Lake Rust</a>: A native Rust library for <a href="https://glossary.airbyte.com/term/delta-lake" target="_blank" rel="noopener noreffer">Delta Lake</a>, with bindings into Python and Ruby</li>
<li><a href="https://github.com/cube-js/cube.js" target="_blank" rel="noopener noreffer">Cube</a>: Headless BI for Building Data Applications
<ul>
<li>Written <a href="https://cube.dev/blog/open-source-looker-alternative" target="_blank" rel="noopener noreffer">mostly in Rust</a>, Cube’s data processing and storage are based on the Arrow DataFusion query execution framework, which uses Apache Arrow as its in-memory format. Especially the core of Cube, the cache layer called <a href="https://cube.dev/blog/introducing-cubestore" target="_blank" rel="noopener noreffer">Cube Store</a> is 100% built-in Rust</li>
</ul>
</li>
<li><a href="https://github.com/vectordotdev/vector" target="_blank" rel="noopener noreffer">Vector.dev</a>: A high-performance observability data pipeline for pulling system data (logs, metadata)</li>
<li><a href="https://github.com/roapi/roapi" target="_blank" rel="noopener noreffer">ROAPI</a>: Create full-fledged APIs for slowly moving datasets without writing a single line of code</li>
<li><a href="https://github.com/meilisearch/meilisearch" target="_blank" rel="noopener noreffer">Meilisearch</a>: Lightning Fast, Ultra Relevant, and Typo-Tolerant search engine</li>
<li><a href="https://github.com/quickwit-oss/tantivy" target="_blank" rel="noopener noreffer">Tantivy</a>: A full-text search engine library</li>
<li><a href="https://github.com/prql/prql" target="_blank" rel="noopener noreffer">PRQL</a>: Pipelined Relational Query Language for transforming data</li>
<li>Many more; please let me know of any</li>
</ul>
<p>Less relevant to data engineering, but still cool:</p>
<ul>
<li><a href="https://github.com/denoland/deno" target="_blank" rel="noopener noreffer">Deno</a>: This is a fast Node.js version</li>
<li><a href="https://github.com/tauri-apps/tauri" target="_blank" rel="noopener noreffer">Tauri</a>: Tauri is a framework for building tiny, blazingly fast binaries for all major desktop platforms</li>
<li><a href="https://github.com/yewstack/yew" target="_blank" rel="noopener noreffer">Yew</a>: A modern Rust framework for creating multi-threaded front-end web apps with WebAssembly.</li>
</ul>
<p>Read more on a currated <a href="https://www.ssp.sh/brain/great-open-source-tools-in-rust/" target="_blank" rel="noopener noreffer">List of great Open-Source Rust Projects</a>.</p>
<h2 id="rust-vs-python">Rust vs. Python</h2>
<p>The downside of Rust, the learning curve is much higher than other languages, such as Python. That&rsquo;s why most Rust programs in data engineering will have a Python <a href="https://github.com/PyO3/pyo3" target="_blank" rel="noopener noreffer">wrapper</a> for integrating it into any Python data pipelines for a long time. It&rsquo;s also a shift from an interpreted language such as Python to a more <a href="https://glossary.airbyte.com/term/functional-programming" target="_blank" rel="noopener noreffer">Functional Language (FP)</a> style, which Rust certainly supports.</p>
<div class="details admonition note open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-note"></i>The upside and downside of the Python language<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content"><p>What makes Python popular right now:</p>
<ul>
<li>It’s old</li>
<li>It’s beginner-friendly</li>
<li>It’s versatile</li>
</ul>
<p>The downsides of Python:</p>
<ul>
<li>Speed / Multithreading</li>
<li>Scope</li>
<li>Mobile Development</li>
<li>Runtime Errors</li>
</ul>
<p>Check more on <a href="https://towardsdatascience.com/why-python-is-not-the-programming-language-of-the-future-30ddc5339b66" target="_blank" rel="noopener noreffer">Why Python is not the programming language of the future</a> or a <a href="https://twitter.com/sspaeti/status/1580551324281999360" target="_blank" rel="noopener noreffer">small Twitter poll</a> if Rust is suited for data engineering.</p>
</div>
        </div>
    </div>
<h3 id="other-recent-programming-languages">Other Recent Programming Languages</h3>
<p>Newer programming languages follow the functional programming approach. New functional programming languages started, such as <a href="https://github.com/scala/scala" target="_blank" rel="noopener noreffer">Scala</a> with <a href="https://github.com/akka/akka" target="_blank" rel="noopener noreffer">Akka</a>, <a href="https://github.com/elixir-lang/elixir" target="_blank" rel="noopener noreffer">Elixir</a>, or multi-paradigm programming languages such as <a href="https://github.com/JuliaLang/julia" target="_blank" rel="noopener noreffer">Julia</a>, <a href="https://github.com/JetBrains/kotlin" target="_blank" rel="noopener noreffer">Kotlin</a> (a <a href="https://insights.stackoverflow.com/trends?tags=rust%2Cscala%2Celixir%2Cclojure%2Cgo%2Chaskell%2Ckotlin" target="_blank" rel="noopener noreffer">fastest-growing</a> language since Google made it default for Android development), and <a href="https://github.com/rust-lang/rust" target="_blank" rel="noopener noreffer">Rust</a>.</p>
<p><a href="https://github.com/golang/go" target="_blank" rel="noopener noreffer">GoLang</a> seems to be a good compiled programming language usedin <a href="https://glossary.airbyte.com/term/dev-ops" target="_blank" rel="noopener noreffer">DevOps</a>.</p>
<p><a href="https://github.com/elixir-lang/elixir" target="_blank" rel="noopener noreffer">Elixir</a> has servers monitoring data pipelines and re-tries included in the language; no framework is needed. It makes an excellent fit for data engineering and would replace parts of the <a href="https://glossary.airbyte.com/term/data-orchestrator" target="_blank" rel="noopener noreffer">Data Orchestrators</a>.</p>
<h3 id="rust-as-a-primary-language">Rust as a Primary Language?</h3>
<p>Let&rsquo;s see an example of a modern data pipeline integrating with <a href="http://airbyte.com/" target="_blank" rel="noopener noreffer">Airbyte</a>, <a href="http://getdbt.com/" target="_blank" rel="noopener noreffer">dbt</a>, and some ML models in Python.</p>
<p>Each step can have errors and data mismatches. That&rsquo;s why we have orchestrator frameworks such as <a href="http://dagster.io/" target="_blank" rel="noopener noreffer">Dagster</a>, which force you to write functional code or the concept of <a href="https://glossary.airbyte.com/term/functional-data-engineering/" target="_blank" rel="noopener noreffer">Functional Data Engineering</a>. There is also lots of adoption in Python with the type hint or writing more <a href="https://towardsdatascience.com/how-to-make-your-python-code-more-functional-b82dad274707" target="_blank" rel="noopener noreffer">Python and Functional Programming</a> style. Or to bring up an example of another language, JavaScript, the rise of <a href="https://github.com/microsoft/TypeScript" target="_blank" rel="noopener noreffer">TypeScript</a>.</p>
<div class="details admonition question open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-question"></i>Will Rust be adapted?<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">The exciting question to me is whether Rust will be adapted as a <strong>primary language</strong> and can do data orchestration work?</div>
        </div>
    </div>
<p>As we typically load data into a data frame and transform or add some business logic within our data pipelines. This could be done efficiently with Rust and Apache Arrow, and DataFusion, which is type-safe, and a good ecosystem. Time will tell.</p>
<h2 id="will-rust-be-the-programming-language-for-data-engineers">Will Rust Be the Programming Language for Data Engineers?</h2>
<p>Rust is a multi-use language and gets the job done for many problems of a data engineer. But the data engineering space is dominated by Python (and <a href="https://glossary.airbyte.com/term/sql" target="_blank" rel="noopener noreffer">SQL</a>) and will stay that way for the foreseeable future. There is no &ldquo;until people fully move into Rust&rdquo;. It&rsquo;s hard to express how many <a href="https://en.wikipedia.org/wiki/List_of_Python_software" target="_blank" rel="noopener noreffer">tools and frameworks</a> are written in Python to interoperate with other Python tools. It&rsquo;s pretty hard to imagine that inertia changing substantially in the next decade.</p>
<p>The Rust projects we have seen above are excellent and will continue to grow for vital and core components, but for them to be helpful for the average data engineer. What was once supposed to be Scala will now be Rust —a backend tooling language to do tasks that need fast and well-maintained code, including a Python wrapper on top.</p>
<p>Writing libraries in Rust feels more like writing long-term infrastructure than writing in higher-level languages such as Python, Java, or the JVM.</p>
<p>What do you think? What is your take on Rust for data engineers?</p>
<h3 id="resources-to-learn-more-on-the-topic">Resources to Learn More on the Topic</h3>
<p>Suppose you want to be up and running within minutes. Karim Jedda has an <a href="https://karimjedda.com/carefully-exploring-rust/" target="_blank" rel="noopener noreffer">article</a>, carefully exploring the Rust programming ecosystem as a 10+ years Python developer, checking how to do everyday programming tasks and what the tooling looks like. Shared Services of Canada did a <a href="https://www.statcan.gc.ca/en/data-science/network/engineering-rust" target="_blank" rel="noopener noreffer">hands-on example</a> with Rust converting raw archive files into JSON for data analysis. Or Mehdi Ouazza&rsquo;s article where he debates the <a href="https://betterprogramming.pub/the-battle-for-data-engineers-favorite-programming-language-is-not-over-yet-bb3cd07b14a0" target="_blank" rel="noopener noreffer">Battle for Data Engineer&rsquo;s Favorite Programming Language</a>. </p>
<p>Learning Rust has many excellent resources. <a href="https://fasterthanli.me/articles/a-half-hour-to-learn-rust" target="_blank" rel="noopener noreffer">A half-hour to learn Rust</a>, <a href="https://doc.rust-lang.org/book/" target="_blank" rel="noopener noreffer">The Rust Book</a>, <a href="https://doc.rust-lang.org/stable/rust-by-example/" target="_blank" rel="noopener noreffer">Rust By Example</a>, <a href="https://readrust.net/" target="_blank" rel="noopener noreffer">Read Rust</a>, or <a href="https://this-week-in-rust.org/" target="_blank" rel="noopener noreffer">This Week In Rust</a>.</p>
<p>Or <a href="https://learning-rust.github.io/" target="_blank" rel="noopener noreffer">Learning Rust</a> with different kinds of formats:</p>
<ul>
<li><a href="http://www.arewewebyet.org/" target="_blank" rel="noopener noreffer">Are we web yet?</a></li>
<li><a href="http://arewegameyet.com/" target="_blank" rel="noopener noreffer">Are we game yet?</a></li>
<li><a href="http://www.arewelearningyet.com/" target="_blank" rel="noopener noreffer">Are we learning yet?</a></li>
<li><a href="https://areweguiyet.com/" target="_blank" rel="noopener noreffer">Are we GUI yet?</a></li>
<li><a href="https://areweaudioyet.com/" target="_blank" rel="noopener noreffer">Are we audio yet?</a></li>
</ul>
<p>Or do you want to get hands-on and search for an example project? How about building an <a href="https://github.com/airbytehq/airbyte/issues/16322" target="_blank" rel="noopener noreffer">Airbyte Delta Lake Destination</a> (Python interface) with <a href="https://github.com/delta-io/delta-rs" target="_blank" rel="noopener noreffer">delta-rs</a>?</p>
<hr>
<pre class=""><em>Originally published at <a href="https://airbyte.com/blog/rust-for-data-engineering/" target="_blank" rel="noopener noreferrer">Airbyte.com</a></em></pre>
]]></description>
</item>
<item>
    <title>Building a Data Engineering Project in 20 Minutes</title>
    <link>https://www.ssp.sh/blog/data-engineering-project-in-twenty-minutes/</link>
    <pubDate>Tue, 09 Mar 2021 18:54:25 &#43;0000</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/data-engineering-project-in-twenty-minutes/</guid><enclosure url="https://www.ssp.sh/blog/data-engineering-project-in-twenty-minutes/images/open-source-logos.png" type="image/png" length="0" /><description><![CDATA[<div class="details admonition tip open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-tip"></i>Project Updates on 2024-03-17<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content"><p>After three years, recognizing the evolving landscape of data engineering tools and the ongoing relevance of this project, I&rsquo;ve made several updates to <a href="https://github.com/ssp-data/practical-data-engineering/" target="_blank" rel="noopener noreffer">Practical Data Engineering</a>. Key changes include updating components like Dagster and Delta Lake, removing Spark in favor of using Delta-rs directly, to streamline local development and simplify the architecture. For those interested in the original architecture, the <a href="https://github.com/ssp-data/practical-data-engineering/tree/v1" target="_blank" rel="noopener noreffer">v1 branch</a> preserves the initial setup.</p>
<p>I also added a quick <a href="https://youtu.be/FfDOsgg2EEQ" target="_blank" rel="noopener noreffer">YouTube Video</a> that shows you how to install and run.</p>
</div>
        </div>
    </div>
<p>This post focuses on practical data pipelines with examples from web-scraping real-estates, uploading them to S3 with <a href="https://min.io/" target="_blank" rel="noopener noreffer">MinIO</a>, <a href="https://spark.apache.org/" target="_blank" rel="noopener noreffer">Spark</a> and <a href="https://delta.io/" target="_blank" rel="noopener noreffer">Delta Lake</a>, adding some Data Science magic with <a href="https://jupyter.org/" target="_blank" rel="noopener noreffer">Jupyter Notebooks</a>, ingesting into Data Warehouse <a href="https://druid.apache.org/" target="_blank" rel="noopener noreffer">Apache Druid</a>, visualising dashboards with <a href="https://superset.apache.org/" target="_blank" rel="noopener noreffer">Superset</a> and managing everything with <a href="https://dagster.io/" target="_blank" rel="noopener noreffer">Dagster</a>.</p>
<p>The goal is to touch on the common data engineering challenges and using promising new technologies, tools or frameworks, which most of them I wrote about in <a href="/blog/business-intelligence-meets-data-engineering/" rel="">Business Intelligence meets Data Engineering with Emerging Technologies</a>. As well everything runs on Kubernetes in a scalable fashion but as well locally with Kubernetes on <a href="https://www.docker.com/products/docker-desktop" target="_blank" rel="noopener noreffer">Docker Desktop</a>.</p>
<p>The source-code you can find on <a href="https://github.com/ssp-data/practical-data-engineering" target="_blank" rel="noopener noreffer">practical-data-engineering</a> for the data pipeline or in <a href="https://github.com/ssp-data/data-engineering-devops" target="_blank" rel="noopener noreffer">data-engineering-devops</a> with all it&rsquo;s details to set things up. Although not all is finished, you can observe the current status of the project on <a href="https://github.com/orgs/ssp-data/projects/1" target="_blank" rel="noopener noreffer">real-estate-project</a>.</p>
<h2 id="what-are-we-building-and-why">What are we building, and why?</h2>
<p>A data application that will collect real-estates coupled with Google maps way calculation but potential other macro or microeconomics like taxes, the population of city, schools, public transportation. Enriched with machine learning correlations to know factors that influence the price the most.</p>
<p>Why this project? When I started in 2018, I wanted to find the next best flat to rent but there was no sophisticated real-estate-portal out there in Switzerland. I found this very entertaining post about <a href="https://funnybretzel.com/datamining-a-flat-in-munich/" target="_blank" rel="noopener noreffer">Datamining a Flat in Munich</a> and wanted to code my own. Presently there are tons of services like <a href="https://www.pricehubble.com/en/" target="_blank" rel="noopener noreffer">PriceHubble</a> in Switzerland or <a href="https://www.zillow.com/" target="_blank" rel="noopener noreffer">Zillow</a> in the US, but still, it&rsquo;s worthwhile to optimise for finding your dream apartment or house. On top of it, it is a genuine example that includes enough data engineering challenges.</p>
<p>Starting with web scraping gives you the power to treat every website as a database. We download the latest properties or those who have changed with a change data capture (CDC) mechanism. Zipping and uploading them to S3. With a Delta Lake table, we merge new changes with <a href="https://databricks.com/blog/2019/09/24/diving-into-delta-lake-schema-enforcement-evolution.html" target="_blank" rel="noopener noreffer">schema evolution</a>. From there we add some machine learning and data science magic with Jupyter notebooks. Ingest it into the Druid data warehouse. Presenting the data in a business intelligence dashboarding manner. All of it is tied together with a well-suited orchestrator. And of course, everything is running cloud-agnostic anywhere, with <a href="https://kubernetes.io/" target="_blank" rel="noopener noreffer">Kubernetes</a>. A gist of how the pipeline will look like as of today you see below.</p>













  

























<figure>
<a target="_blank" href="/blog/data-engineering-project-in-twenty-minutes/images/Dagster-Practical-Data-Engineering-Pipeline.png" title="/blog/data-engineering-project-in-twenty-minutes/images/Dagster-Practical-Data-Engineering-Pipeline.png">

</a><figcaption class="image-caption">Dagster UI – Practical Data Engineering Pipeline</figcaption>
</figure>
<h2 id="u-what-will-you-learn">&#x1f4a1; What will you learn?</h2>
<p>Below I noted the key learnings which are integrated into a full-fledged data engineering project to illustrate the how part in the most hands-on way. Hopefully, you&rsquo;ll find something interesting for you!</p>
<ul>
<li><strong>Scraping with Beautiful Soup</strong>: How to get value out from a website with basic Python skills.</li>
<li><strong>Change Data Capture (CDC) with Scraping</strong>: Using a fingerprint to verify against the data lake if a property needs to be downloaded or not.</li>
<li><strong>How to use an S3-Gateway / Object Storage</strong>: Placing an S3 API in front of your object storage in the so-called “gateway-mode” to stay cloud-agnostic. This allows you to change the object-store from Amazon S3 to Azure blob storage or Google cloud storage with ease.</li>
<li><strong>UPSERTs and ACID Transactions</strong>: Besides schema evolution mentioned above, Delta Lake also provides merge, update and delete directly on your distributed files.</li>
<li><strong>Automatic Schema Evolution</strong>: With the growing popularity of data lakes and <a href="/blog/data-warehouse-vs-data-lake-etl-vs-elt/#ETL_vs_ELT" rel="">ELT</a>, data engineers are left with lots of data but no schemas. To integrate schema and especially schema changes, automatic schema evolution is important.</li>
<li><strong>Integrating Jupyter Notebooks - the right way</strong>: Notebooks hold important data transformations, calculations or machine learning models yet it&rsquo;s always hard to copy the living code in your data pipelines. We will integrate notebooks as a step of our pipeline with Dagster.</li>
<li><strong>Learning about Apache Druid</strong>: Druid is one of the fastest Data Warehouse / OLAP Solutions. It&rsquo;s optimized for fast real-time ingestion and immutable data. On the downside, it&rsquo;s hard to set up, luckily, below you see how to do exactly that.</li>
<li><strong>Open-Source dashboarding with Apache Superset</strong>: How to use Superset with its many out-of-box connections. On top, it&rsquo;s free of charge compared to Tableau, Looker and others.</li>
<li><strong>DevOps with Kubernetes</strong>: How to run Kubernetes locally and install all of the tools here. If you haven&rsquo;t used Kubernetes, don&rsquo;t worry, examples and local set-up with Kubernetes for Docker are included.</li>
<li><strong>Introduction to features of Dagster</strong>: Showing how all of the data engineering parts can be tided together with one open-sourced tool called Dagster (<a href="https://qr.ae/pNrIPi" target="_blank" rel="noopener noreffer">alternative to Airflow</a>).</li>
<li>And many more which I won&rsquo;t mention but you&rsquo;ll hopefully see along the way.</li>
</ul>
<h2 id="hands-on-with-tech-tools-and-frameworks">Hands-On with Tech, Tools and Frameworks</h2>
<p>In an earlier post about <a href="/blog/open-source-data-warehousing-druid-airflow-superset/" rel="">Open-Source Data Warehousing</a>, I focused explicitly on Apache Druid, Airflow and Superset. This post is all about using data engineering in a practical example. To give you an overview of what we use, I extended the tech, tools and frameworks I used in these blog post on top of the newer Databricks <a href="http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf" target="_blank" rel="noopener noreffer">Lakehouse Paradigm</a>.</p>













  

























<figure>
<a target="_blank" href="/blog/data-engineering-project-in-twenty-minutes/images/lakehouse-open-sourced.png" title="/blog/data-engineering-project-in-twenty-minutes/images/lakehouse-open-sourced.png">

</a><figcaption class="image-caption">Databricks Lakehouse Paradigm with used Open-Source Technologies added</figcaption>
</figure>
<p>Below you&rsquo;ll find different chapters for different topics. I included at least one practical example with some hands-on code but kept it minimalistic as the source code is all open in the above-mentioned <a href="http://code.sspaeti.com" target="_blank" rel="noopener noreffer">repositories</a>. Some chapters include extra information or architectural reasoning of why I believed a certain tool or method is especially suited for the use-case. But let&rsquo;s get started with scraping data implemented with Python.</p>
<h3 id="getting-the-data--scraping">Getting the Data – Scraping</h3>
<div class="details admonition warning open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-warning"></i>Disclaimer<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">Everything shown here is demonstrated for learning purposes only. Before you begin, make sure you don’t violate the copyright of any website and always <a href="https://www.zyte.com/learn/web-scraping-best-practices" target="_blank" rel="noopener noreffer">be friendly</a> when scraping.</div>
        </div>
    </div>
<p>The internet has an infinite amount of information, that’s why scraping is valuable to know even though less know for data engineers. As a first step, we are getting the properties from a real-estate portal. In my case, I chose a Swiss portal, but you can choose anyone from your country. There are two main Python libraries to achieve this, <a class="markup--anchor markup--p-anchor" href="https://scrapy.org/" target="_blank" rel="noopener" data-href="https://scrapy.org/">Scrapy</a> and <a class="markup--anchor markup--p-anchor" href="https://www.crummy.com/software/BeautifulSoup/" target="_blank" rel="noopener" data-href="https://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a>. I used the latter for its simplicity.</p>
<p class="graf graf--p">
  My initial goal was to scrape some properties from the web-page by determining how many search page result I get and scrape through each property from each page. While testing around with BeautifulSoup and <a class="markup--anchor markup--p-anchor" href="https://ipython.readthedocs.io/en/stable/" target="_blank" rel="noopener" data-href="https://ipython.readthedocs.io/en/stable/">IPython</a> — IPython is an excellent way to initially test your code — and asking my way through StackOverflow. I found that certain websites provide open APIs which you find in the documentation of their website or with the interactive developer tools (F12) explained below. This will save you from scraping everything manually and therefore also producing less traffic on the site of the provider.
</p>
<div class="details admonition info">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-info"></i>How to check open APIs<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content"><p>If you want to check if another website has an open API, you can search for an HTTP request by simply clicking F12 and switching to the network tab to check requests that your browser send when clicking on a property.</p>
<figure><a target="_blank" href="images/web-scraping-api_anomised.png" title="/blog/data-engineering-project-in-twenty-minutes/images/web-scraping-api_anomised.png" >
        
    </a><figcaption class="image-caption">An example with Webbrowser Brave (Chrome like)</figcaption>
    </figure></div>
        </div>
    </div>
<p>To get started with web-scraping, it helps when you know some <a class="markup--anchor markup--p-anchor" href="https://www.w3schools.com/html/html_basic.asp" target="_blank" rel="noopener" data-href="https://www.w3schools.com/html/html_basic.asp">basic HTML</a>. To get an overview of the site you would like to scrape your data from, use the above interactive developer tools. You can now inspect in which <code>&lt;table&gt;</code>, <code>&lt;div&gt;</code> or <code>id</code>, <code>class</code> or <code>href</code> your information is found. Most websites with valuable data have ever-changing id’s or classes, which makes it a bit harder to just grab the specific data you need.</p>
<p>Let’s say we want to buy a house in Bern, the capital of Switzerland. We can e.g. use this URL with this search term: <a class="markup--anchor markup--p-anchor" href="https://www.immoscout24.ch/en/house/buy/city-bern?r=7&map=1" target="_blank" rel="noopener" data-href="https://www.immoscout24.ch/en/house/buy/city-bern?r=7&map=1"><a href="https://www.immoscout24.ch/en/house/buy/city-bern?r=7&amp;map=1" target="_blank" rel="noopener noreffer">https://www.immoscout24.ch/en/house/buy/city-bern?r=7&map=1</a></a>. R, in this case, is the radius around Bern and map=1 mean we only want properties with a price tag. As mentioned we need to find out how many pages of result we have. We can see this information is at the very bottom. A hacky example that worked for me is I searched all buttons on the page and chose the one smaller and equal three which equals me as the last page number two as of today. An example code to scrape how many pages of search results we have:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">requests</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">url</span> <span class="o">=</span> <span class="s1">&#39;https://www.immoscout24.ch/en/house/buy/city-bern?r=7&amp;map=1&#39;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">html</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html</span><span class="o">.</span><span class="n">text</span><span class="p">,</span> <span class="s2">&#34;html.parser&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">buttons</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">findAll</span><span class="p">(</span><span class="s1">&#39;button&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">p</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">buttons</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">item</span><span class="o">.</span><span class="n">text</span><span class="p">)</span> <span class="o">&lt;=</span> <span class="mi">3</span> <span class="o">&amp;</span> <span class="nb">len</span><span class="p">(</span><span class="n">item</span><span class="o">.</span><span class="n">text</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="n">item</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">p</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">item</span><span class="o">.</span><span class="n">text</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">if</span> <span class="n">p</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">lastPage</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">p</span><span class="o">.</span><span class="n">pop</span><span class="p">())</span>
</span></span><span class="line"><span class="cl"><span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">lastPage</span> <span class="o">=</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">lastPage</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">## -- Output --</span>
</span></span><span class="line"><span class="cl"><span class="o">&lt;</span><span class="n">button</span> <span class="n">aria</span><span class="o">-</span><span class="n">disabled</span><span class="o">=</span><span class="s2">&#34;true&#34;</span> <span class="n">aria</span><span class="o">-</span><span class="n">pressed</span><span class="o">=</span><span class="s2">&#34;true&#34;</span> <span class="n">class</span><span class="o">=</span><span class="s2">&#34;bkivry-0 as6woy-0 c2ol4x-0 hXMMbP&#34;</span> <span class="n">disabled</span><span class="o">=</span><span class="s2">&#34;&#34;</span> <span class="nb">type</span><span class="o">=</span><span class="s2">&#34;button&#34;</span><span class="o">&gt;</span><span class="mi">1</span><span class="o">&lt;/</span><span class="n">button</span><span class="o">&gt;</span>
</span></span><span class="line"><span class="cl"><span class="o">&lt;</span><span class="n">button</span> <span class="n">aria</span><span class="o">-</span><span class="n">disabled</span><span class="o">=</span><span class="s2">&#34;true&#34;</span> <span class="n">class</span><span class="o">=</span><span class="s2">&#34;bkivry-0 as6woy-0 c2ol4x-0 hXMMbP&#34;</span> <span class="n">disabled</span><span class="o">=</span><span class="s2">&#34;&#34;</span> <span class="nb">type</span><span class="o">=</span><span class="s2">&#34;button&#34;</span><span class="o">&gt;</span><span class="mi">2</span><span class="o">&lt;/</span><span class="n">button</span><span class="o">&gt;</span>
</span></span><span class="line"><span class="cl"><span class="mi">2</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>To get to a list of property IDs I assembled a search-link for each search where I grabbed links that stared with &ldquo;en/d&rdquo; and had a number in it. Some sample code below:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">re</span>
</span></span><span class="line"><span class="cl"><span class="n">url</span> <span class="o">=</span> <span class="s1">&#39;https://www.immoscout24.ch/en/house/buy/city-bern?pn=1&amp;r=7&amp;se=16&amp;map=1&#39;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">ids</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl"><span class="n">html</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html</span><span class="o">.</span><span class="n">text</span><span class="p">,</span> <span class="s2">&#34;html.parser&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">links</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">findAll</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="n">href</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">hrefs</span> <span class="o">=</span> <span class="p">[</span><span class="n">item</span><span class="p">[</span><span class="s1">&#39;href&#39;</span><span class="p">]</span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">links</span><span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="n">hrefs_filtered</span> <span class="o">=</span> <span class="p">[</span><span class="n">href</span> <span class="k">for</span> <span class="n">href</span> <span class="ow">in</span> <span class="n">hrefs</span> <span class="k">if</span> <span class="n">href</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">&#39;/en/d&#39;</span><span class="p">)]</span>
</span></span><span class="line"><span class="cl"><span class="n">ids</span> <span class="o">+=</span> <span class="p">[</span><span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s1">&#39;\d+&#39;</span><span class="p">,</span> <span class="n">item</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">hrefs_filtered</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">ids</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">## -- Output --</span>
</span></span><span class="line"><span class="cl"><span class="p">[</span><span class="s1">&#39;6331937&#39;</span><span class="p">,</span> <span class="s1">&#39;6330580&#39;</span><span class="p">,</span> <span class="s1">&#39;6329423&#39;</span><span class="p">,</span> <span class="s1">&#39;6298722&#39;</span><span class="p">,</span> <span class="s1">&#39;6261621&#39;</span><span class="p">,</span> <span class="s1">&#39;6311343&#39;</span><span class="p">,</span> <span class="s1">&#39;6318070&#39;</span><span class="p">,</span> <span class="s1">&#39;6313553&#39;</span><span class="p">,</span> <span class="s1">&#39;6317089&#39;</span><span class="p">,</span> <span class="s1">&#39;6306531&#39;</span><span class="p">,</span> <span class="s1">&#39;6305793&#39;</span><span class="p">,</span> <span class="s1">&#39;6296041&#39;</span><span class="p">,</span> <span class="s1">&#39;6294327&#39;</span><span class="p">,</span> <span class="s1">&#39;6284892&#39;</span><span class="p">,</span> <span class="s1">&#39;6283242&#39;</span><span class="p">,</span> <span class="s1">&#39;6282624&#39;</span><span class="p">,</span> <span class="s1">&#39;6274328&#39;</span><span class="p">,</span> <span class="s1">&#39;6251376&#39;</span><span class="p">,</span> <span class="s1">&#39;6237199&#39;</span><span class="p">,</span> <span class="s1">&#39;6237144&#39;</span><span class="p">,</span> <span class="s1">&#39;6231495&#39;</span><span class="p">,</span> <span class="s1">&#39;6224144&#39;</span><span class="p">,</span> <span class="s1">&#39;6223578&#39;</span><span class="p">,</span> <span class="s1">&#39;6209944&#39;</span><span class="p">]</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Complete code above you can find on GitHub on <a href="https://github.com/ssp-data/practical-data-engineering/blob/v1/src/pipelines/real-estate/realestate/common/solids_scraping.py">solids_scraping.py</a> in functions called <code>list_props_immo24</code> and <code>cache_properies_from_rest_api</code>.</p></p>
<h3 id="storing-on-s3-minio">Storing on S3-MinIO</h3>
<p>
  With an object storage, you provide one single API without lock you in into a cloud vendor and you can always access the same URL/API within your application or pipelines. I use <a href="http://min.io">MinIO</a> but there are several <a href="https://en.wikipedia.org/wiki/Amazon_S3#Notable_users">others</a>. They normally run on Kubernetes, are open-source and may also boost performance. <span style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen-Sans, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif;">Plus if you haven't access to your own S3 e.g. locally or on your servers, you can simply create one in three lines of code: <span style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen-Sans, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif;">
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span><span class="lnt">9
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">wget https://dl.min.io/server/minio/release/linux-amd64/minio
</span></span><span class="line"><span class="cl">chmod +x minio
</span></span><span class="line"><span class="cl">./minio server /data
</span></span><span class="line"><span class="cl"><span class="c1"># — Output —</span>
</span></span><span class="line"><span class="cl">Endpoint: http://192.168.2.128:9000 http://127.0.0.1:9000
</span></span><span class="line"><span class="cl">AccessKey: your-key
</span></span><span class="line"><span class="cl">SecretKey: your-secret
</span></span><span class="line"><span class="cl">Browser Access:
</span></span><span class="line"><span class="cl">http://192.168.2.128:9000 http://127.0.0.1:9000
</span></span></code></pre></td></tr></table>
</div>
</div><p>You can access the endpoint <code>127.0.0.1:9000</code> programmatically with its key/secrets. On top, you get a full-blown UI as you can see below.</p></p>













  

























<figure>
<a target="_blank" href="/blog/data-engineering-project-in-twenty-minutes/images/2021-01-14_22-14-24.png" title="/blog/data-engineering-project-in-twenty-minutes/images/2021-01-14_22-14-24.png">

</a><figcaption class="image-caption">Local Minio UI</figcaption>
</figure>
<h3 id="change-data-capture-cdc">Change Data Capture (CDC)</h3>
  <p>
    CDC is a powerful tool and especially in cloud environments with event-driven architectures. I used it to minimize the downloads of already downloaded properties. Besides existing open-source CDC solutions like <a href="https://debezium.io/">Debezium</a>, I implemented my own simple logic to detect the changes. Also because I have no access to the source-<a href="https://en.wikipedia.org/wiki/Online_transaction_processing">OLTP</a> database where the properties are stored which you'd need.
  </p>
  <p>
    I accomplish the CDC by creating two functions. The first one lists all properties to certain search criteria and the second one compares these properties with existing once. How am I doing that? Primarily, I create a <a href="https://en.wikipedia.org/wiki/Fingerprint_(computing)">fingerprint</a> from each property that will tell me if the one is new or already exstinging. You might ask why I'm not using the unique property-ID? The reasons are I didn't just want to check if I have the property or not. As mentioned in the intro I also wanted to check if the seller lowered the price over time to be able to notify when the seller can't get rid of his house or flat. My fingerprint combines the property-ID and the selling price (called <code>normalized_price</code>in my data). One more benefit if more columns getting relevant, I could just add them to the fingerprint and my CDC mechanism would be extended without changing any other code.
  </p>
  <p>
    To have the relevant selling price for each property-ID I will scrape them separately from the website same as the IDs themselves. You can check that code in <a href="https://github.com/ssp-data/practical-data-engineering/blob/v1/src/pipelines/real-estate/realestate/common/solids_scraping.py#L121">solid_scraping.py</a>. The function is called <code>list_props_immo24</code> which returns all properties as a data frame for my search criteria.
  </p>
  <p>
    The logic for CDC happens in <code>get_changed_or_new_properties</code> in <a href="https://github.com/ssp-data/practical-data-engineering/blob/v1/src/pipelines/real-estate/realestate/common/solids_spark_delta.py">solids_spark_delta.py</a> where I compare the existing once in my delta table with the new coming from list function above. As Delta Lake supports an SQL-API I can use plain SQL to compare the two with this simple SELECT-Statement:
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">p</span><span class="p">.</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="p">.</span><span class="n">fingerprint</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="p">.</span><span class="n">is_prefix</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="p">.</span><span class="n">rentOrBuy</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="p">.</span><span class="n">city</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="p">.</span><span class="n">propertyType</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="p">.</span><span class="n">radius</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="p">.</span><span class="n">last_normalized_price</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">FROM</span><span class="w"> </span><span class="n">pd_properties</span><span class="w"> </span><span class="n">p</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">LEFT</span><span class="w"> </span><span class="k">OUTER</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">pd_existing_props</span><span class="w"> </span><span class="n">e</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">ON</span><span class="w"> </span><span class="n">p</span><span class="p">.</span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">e</span><span class="p">.</span><span class="n">propertyDetails_id</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">p</span><span class="p">.</span><span class="n">fingerprint</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="n">e</span><span class="p">.</span><span class="n">fingerprint</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="k">OR</span><span class="w"> </span><span class="n">e</span><span class="p">.</span><span class="n">fingerprint</span><span class="w"> </span><span class="k">IS</span><span class="w"> </span><span class="k">NULL</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><h3 id="adding-database-features-to-s3--delta-lake--spark">Adding Database features to S3 – Delta Lake &amp; Spark</h3>
  <p>
    <strong>To get database alike features on top of your S3 files, you simply need to create a <a href="https://delta.io/">Delta Lake</a> table</strong>. For example, to add a dynamic schema to not break ingestions into a data lake or data pipelines downstream is quite a challenge. Delta is doing that and automatically add new columns incrementally in an <a href="https://docs.databricks.com/delta/concurrency-control.html#optimistic-concurrency-control">optimistic concurrent way</a>. As data in a data lake ist mostly distributed files, this is quite hard if you were to do that yourself. But as Delta already enforce schema and stores this information in the <a href="https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html">transaction log</a> it makes sense to handle this with Delta. In my data sets with 60+ dynamic and changing columns, I made use of this feature extensively.
  </p>
  <p>
    How to create or read a Delta table then? It can be easily done with providing the format <code>delta</code> opposed to <code>parquet</code> or other formats you'd know:
  </p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1">#create delta table</span>
</span></span><span class="line"><span class="cl"><span class="n">data</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">data</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s2">&#34;delta&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s2">&#34;/tmp/delta-table&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1">#reading it</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="err">“</span><span class="n">delta</span><span class="err">”</span><span class="p">)</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="err">“</span><span class="o">/</span><span class="n">tmp</span><span class="o">/</span><span class="n">delta</span><span class="o">-</span><span class="n">table</span><span class="err">”</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div>  <p>
    Another feature of Delta is the automatic snapshotting mechanism with <a href="https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html">time-travel</a> that lets you check older versions of a table. This can become very handy for dimension-tables to protocol history e.g. addresses. This way you can skip a rather complex <a href="https://www.kimballgroup.com/2008/09/slowly-changing-dimensions-part-2/">SCD2</a> logic. Just make sure to set your <a href="https://docs.delta.io/latest/delta-utility.html#vacuum">retention threshold</a> high enough before you using <a href="https://docs.delta.io/latest/delta-utility.html#remove-files-no-longer-referenced-by-a-delta-table">vacuum</a> (deletion of older data).
  </p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1">#Read older versions of data using time travel</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="err">“</span><span class="n">delta</span><span class="err">”</span><span class="p">)</span><span class="o">.</span><span class="n">option</span><span class="p">(</span><span class="err">“</span><span class="n">versionAsOf</span><span class="err">”</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="err">“</span><span class="o">/</span><span class="n">tmp</span><span class="o">/</span><span class="n">delta</span><span class="o">-</span><span class="n">table</span><span class="err">”</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div>  <p>
    Also handy is it does not matter if you're reading from a stream or batch. Delta support both in a single API and target sink. You can see that well explained at <a href="https://youtu.be/FePv0lro0z8">Beyond Lambda: Introducing Delta Architecture</a> or with some <a href="https://docs.delta.io/latest/delta-streaming.html#delta-table-as-a-sink&language-python">code examples</a>. Often used MERGE-statement in SQL can be applied on your distributed files as well with Delta including schema evolution and ACID transaction:
  </p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="c1">--A simple example:
</span></span></span><span class="line"><span class="cl"><span class="n">MERGE</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">USING</span><span class="w"> </span><span class="n">updates</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">   </span><span class="k">ON</span><span class="w"> </span><span class="n">events</span><span class="p">.</span><span class="n">eventId</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">updates</span><span class="p">.</span><span class="n">eventId</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="k">WHEN</span><span class="w"> </span><span class="n">MATCHED</span><span class="w"> </span><span class="k">THEN</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	  </span><span class="n">UDATE</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">events</span><span class="p">.</span><span class="k">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">updates</span><span class="p">.</span><span class="k">data</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="k">WHEN</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="n">MATCHED</span><span class="w"> </span><span class="k">THEN</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"> 	  </span><span class="k">INSERT</span><span class="w"> </span><span class="p">(</span><span class="nb">date</span><span class="p">,</span><span class="w"> </span><span class="n">eventId</span><span class="p">,</span><span class="w"> </span><span class="k">data</span><span class="p">)</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="nb">date</span><span class="p">,</span><span class="w"> </span><span class="n">eventId</span><span class="p">,</span><span class="w"> </span><span class="k">data</span><span class="p">)</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div>  <p>
    Further motivation why I'm using Delta for my project:
  </p>
  <ul>
    <li>
      using SQL on top of my distributed files
    </li>
    <li>
      simply merge my new properties into my data lake, no need to manually identify data changes
    </li>
    <li>
      working with JSONs and each has a totally different schema, I don't need to worry about that with schema evolution
    </li>
    <li>
      I get a full-blown transaction log to see what went on
    </li>
    <li>
      everything is well compressed and in columnar format stored ready for analytics query with open-source <a href="https://parquet.apache.org/">Apache Parquet</a> files
    </li>
    <li>
      I have rich APIs in different languages with Scala, Java, Python and SQL
    </li>
    <li>
      with deletes integrated I'm prepared for <a href="https://en.wikipedia.org/wiki/General_Data_Protection_Regulation">GDPR</a> requirements
    </li>
    <li>
      I can always travel back in time to see how the selling price of my properties has risen over time
    </li>
    <li>
      no need to worry about size and speed as everything is scalable with Spark, even the metadata.
    </li>
    <li>
      future proof with unified batch and streaming source and sink - no need for a <a href="https://en.wikipedia.org/wiki/Lambda_architecture">lambda architecture</a> where batch and streaming is handled separately
    </li>
    <li>
      everything is open-source, the data format with Apache Parquet and <a href="https://github.com/delta-io/delta">Delta Lake</a> itself
    </li>
  </ul>
  <p>
    To add to the popularity of SQL, I added a Dagster generic solid that can pass any SQL statement along and it will use spark to run on top of my Delta Lake tables. The solid is called <code>_sql_solid</code> (original coming from Dagster <a href="https://docs.dagster.io/examples/airline_demo">airline-demo</a>). The full integrated example in <a href="https://github.com/ssp-data/practical-data-engineering/blob/v1/src/pipelines/real-estate/realestate/common/solids_spark_delta.py">solids_spark_delta.py</a> or below an extract of how I can pass along the merge within Dagster.
  </p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">merge_property_delta</span> <span class="o">=</span> <span class="n">sql_solid</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">name</span><span class="o">=</span><span class="s2">&#34;merge_property_delta&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">sql_statement</span><span class="o">=</span><span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">    MERGE INTO {{ target_delta_table }} trg
</span></span></span><span class="line"><span class="cl"><span class="s2">    USING input_dataframe AS src
</span></span></span><span class="line"><span class="cl"><span class="s2">       ON trg.propertyDetails_id = src.propertyDetails_id
</span></span></span><span class="line"><span class="cl"><span class="s2">     WHEN MATCHED THEN
</span></span></span><span class="line"><span class="cl"><span class="s2">          UPDATE SET *
</span></span></span><span class="line"><span class="cl"><span class="s2">     WHEN NOT MATCHED THEN
</span></span></span><span class="line"><span class="cl"><span class="s2">          INSERT *
</span></span></span><span class="line"><span class="cl"><span class="s2">    &#34;&#34;&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">materialization_strategy</span><span class="o">=</span><span class="s2">&#34;delta_table&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">input_defs</span><span class="o">=</span><span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="n">InputDefinition</span><span class="p">(</span><span class="s2">&#34;target_delta_table&#34;</span><span class="p">,</span> <span class="n">DeltaCoordinate</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">        <span class="n">InputDefinition</span><span class="p">(</span><span class="s2">&#34;input_dataframe&#34;</span><span class="p">,</span> <span class="n">DataFrame</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="p">],</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="machine-learning-part--jupyter-notebook">Machine Learning part – Jupyter Notebook</h3>
  <p>
    I'm not a data scientist, still, I wanted to have some insights and fun with my data as well. That's why initially copied one or two Notebook from <a href="https://www.kaggle.com/">Kaggle</a> to play around with. In my project, I wanted to integrate them into my pipeline.
  </p>













  

























<figure>
<a target="_blank" href="/blog/data-engineering-project-in-twenty-minutes/images/scatterplot_fun.png" title="/blog/data-engineering-project-in-twenty-minutes/images/scatterplot_fun.png">

</a><figcaption class="image-caption">Scatterplots from different attributes of real-estates associated with the selling Price</figcaption>
</figure>
  <p>
    Why bother with Jupyter notebooks? Because you most probably have skilled people who are creating the most advanced notebooks with real insight from your data. But unfortunately, these notebooks need to run manually and are not integrated into the data pipelines. There are two options in my opinion. Either you can test and approve notebooks, integrate them into your pipeline, this is basically to copy your python code over into your pipelines. This obviously is a lot of work and does not support changes from the data scientists in the notebooks as these need to be copied over again. So what else could we do?
  </p>
  <p>
    Good thing there is <a href="https://github.com/nteract/papermill">Papermill</a> that lets you run jupyter notebooks directly. And even better, Dagster integrated Papermill into <a href="https://docs.dagster.io/overview/packages/dagstermill">dagstmill</a> which lets you place one notebook as part of your existing data pipeline. On top of that, you have visibility within Dagster's UI, which lets you open the notebook directly. As well you can interact with the input and output of the notebook or use the output further downstream in your pipeline.
  </p>


<div class="x-embed-wrapper" style="display: flex; justify-content: center; margin: 1.5em 0;">
          <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">16/ By integrating it into Dagster, it is accessible and understandable with our tools: <a href="https://t.co/qUuGFDDktZ">pic.twitter.com/qUuGFDDktZ</a></p>&mdash; Nick Schrock (@schrockn) <a href="https://twitter.com/schrockn/status/1293240737027375105?ref_src=twsrc%5Etfw">August 11, 2020</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>


        </div>
        <script>
        (function() {
          function updateXTheme() {
            var isDark = document.body.getAttribute('theme') === 'dark';
            var theme = isDark ? 'dark' : 'light';
            
            document.querySelectorAll('blockquote.twitter-tweet').forEach(function(el) {
              el.setAttribute('data-theme', theme);
            });
            
            document.querySelectorAll('.x-embed-wrapper iframe').forEach(function(iframe) {
              var src = iframe.src;
              if (src) {
                var url = new URL(src);
                if (url.searchParams.get('theme') !== theme) {
                  url.searchParams.set('theme', theme);
                  iframe.src = url.toString();
                }
              }
            });
          }
          
          updateXTheme();
          
          new MutationObserver(function(mutations) {
            mutations.forEach(function(m) {
              if (m.attributeName === 'theme') updateXTheme();
            });
          }).observe(document.body, { attributes: true });
        })();
        </script>
  <p>
    My part of the integration, you can find in <code>data_exploration</code>in <a href="https://github.com/ssp-data/practical-data-engineering/blob/v1/src/pipelines/real-estate/realestate/common/solids_jupyter.py">solid_jupyter.py</a>.
  </p>
<h3 id="ingesting-data-warehouse-for-low-latency--apache-druid">Ingesting Data Warehouse for low latency – Apache Druid</h3>
  <p>
    Most business intelligence solutions include a fast responsive <a href="/blog/olap-whats-coming-next/#What_is_OLAP">OLAP</a> solution often done with cubes. For example, in Microsoft SQL Server you have <a href="https://en.wikipedia.org/wiki/Microsoft_Analysis_Services">Analysis Services</a>. But what should you use if you want an open-source product which is able to handle big data with no problems? One excellent choice is <a href="https://druid.apache.org/">Apache Druid</a>, but if you want to know more details or find other ways for you, check out my blog post about <a href="/blog/olap-whats-coming-next/">OLAP, and what's next</a>.
  </p>
  <p>
    Druid is a beast to set up, but luckily in my <a href="https://github.com/ssp-data/data-engineering-devops/tree/main/src/druid">data-engineering-devops</a> infrastructure project, you find how to set it up on Kubernetes, or locally on your laptop with <a href="https://www.docker.com/products/docker-desktop">Docker Desktop</a> which provides a native Kubernetes. Also, check out the original <a href="https://github.com/helm/charts/tree/master/incubator/druid">helm chart</a> coming from Druid.
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1">#create namespace</span>
</span></span><span class="line"><span class="cl">kubectl create namespace druid
</span></span><span class="line"><span class="cl">kubectl get namespaces
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">#PersistentVolumes (pv) and PersistenVolumeClaims (pvc)</span>
</span></span><span class="line"><span class="cl"><span class="c1">#get context:</span>
</span></span><span class="line"><span class="cl">kubectl config current-context
</span></span><span class="line"><span class="cl"><span class="c1">#use above context and set namespace to druid:</span>
</span></span><span class="line"><span class="cl">kubectl config set-context docker-desktop --namespace<span class="o">=</span>druid
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">#create perstisten volumes</span>
</span></span><span class="line"><span class="cl"><span class="nb">cd</span> <span class="nv">$git</span>/data-engineering-devops/src/druid
</span></span><span class="line"><span class="cl">kubectl apply -f  manifests/base/persistentVolume/volumes.yaml
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">#druid deployment</span>
</span></span><span class="line"><span class="cl"><span class="nb">cd</span> <span class="nv">$git</span>/data-engineering-devops/src/druid
</span></span><span class="line"><span class="cl">kubectl apply -k manifests/overlays/dev/localhost/sspaeti
</span></span><span class="line"><span class="cl">kubectl delete -k manifests/overlays/dev/localhost/sspaeti
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">#Port-forwarding to access druid-UI</span>
</span></span><span class="line"><span class="cl">kubectl get pod
</span></span><span class="line"><span class="cl">kubectl port-forward druid-router-86798c8b4c-vjvxj 8888:8888
</span></span></code></pre></td></tr></table>
</div>
</div>  <p>
    For the project part, I set it up and ingested some properties, but that was more to test the set-up locally. As speed is not a major requirement for me right now, and Druid eats a lot of resources and hard to run locally, I'm focusing on the Delta Lake to be the single source of thought for my queries.
  </p>
<h3 id="the-ui-with-dashboards-and-more--apache-superset">The UI with Dashboards and more – Apache Superset</h3>













  

























<figure>
<a target="_blank" href="/blog/data-engineering-project-in-twenty-minutes/images/apache_superset_scale.png" title="/blog/data-engineering-project-in-twenty-minutes/images/apache_superset_scale.png">

</a><figcaption class="image-caption">Scale data access across any data architecture</figcaption>
</figure>
  <p>
    No project isn't complete without a nice UI that visualises your insights. For its <a href="https://preset.io/blog/future-of-business-intelligence/">open-source purposes</a> and features, I use Apache Superset for some time now. Lately, Superset announces version 1.0 and it is among the <a href="https://gitstar-ranking.com/repositories?page=2">top 200 projects</a> on GitHub. The founder <a href="https://medium.com/@maximebeauchemin">Maxime Beauchemin</a> and his company <a href="https://preset.io/">Preset</a> are building more and more amazing features, one being to <a href="https://preset.io/blog/2020-07-02-hello-world/">create your own plugins</a> easily.
  </p>
  <p>
    Superset can easily connect to Druid natively, or it can query a data lake with Delta Lake tables, plus it can handle almost any kind of SQL based database. The <a href="https://github.com/ssp-data/data-engineering-devops/tree/main/src/superset">dockerfile</a> I use is the original with adding <code>pydruid</code>for querying Druid. Functionalities as exploring, dashboarding views and how to investigate your data you see below:
  </p>













  

























<figure>
<a target="_blank" href="/blog/data-engineering-project-in-twenty-minutes/images/superset-view.png" title="/blog/data-engineering-project-in-twenty-minutes/images/superset-view.png">

</a><figcaption class="image-caption">Apache Superset Dashboard Functionality</figcaption>
</figure>
<h3 id="orchestrating-everything-together--dagster">Orchestrating everything together – Dagster</h3>
  <p>
    Ultimately, the part that glues everything together, the orchestrator. Today there is quite an <a href="https://github.com/pditommaso/awesome-pipeline#pipeline-frameworks--libraries">extended list</a> of orchestrator out there. I tried to highlight the most suitable  <a href="https://qr.ae/pNrIPi">alternatives to Apache Airflow</a> and went with <a href="https://dagster.io/">Dagster</a> for the coming reasons below.
  </p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/ytifPclmaKQ?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

  <p>
    Data pipelines start simple and straight-forward, <strong>but often they end up vastly heterogeneous with various APIs, Spark, cloud data warehouse, and multi-cloud-providers</strong>. Above a real-live example from <a href="https://www.goodeggs.com/">GoodEggs</a> which includes <a href="https://mode.com/">mode</a>, <a href="https://networkx.org/">networkx</a>, <a href="https://www.stitchdata.com/">stitch</a>, SQL, Jupyter-notebooks, Slack-connector, <a href="https://cronitor.io/">cronitor</a>, and many more. This is a complex data pipeline but it is still fairly common to have such an amount of diverse technologies.
  </p>
  <p>
    Why I'm saying all that? Because this is one place where Dagster shines. <strong>It's built with a high-level abstraction in mind</strong>, not just as an executor. Even more, you can use different executor e.g. Airflow, Celery, Dask or Dagster itself, no lock-in here. Dagster lets you focus on building your data pipelines. It is made for data-flows and with it to pass data between the <a href="https://docs.dagster.io/overview/solids-pipelines/solids">solids</a> (their name of tasks). The integration with <a href="https://docs.dagster.io/overview/modes-resources-presets/modes-resources">modes</a> lets you switch from dev to test and production with one click and different resources on each mode. Let's say you won't have a <a href="https://www.snowflake.com/">snowflake-db</a> locally available, so you could just mock it or use a simple Postgres locally for testing but with no changing of your data-pipeline code.
  </p>
  <p>
    <strong>You have an elegant way of separating business logic in <a href="https://docs.dagster.io/overview/solids-pipelines/solids">solids</a> and technical code within <a href="https://docs.dagster.io/overview/modes-resources-presets/modes-resources">resources</a></strong>. Resources are also commonly available in solids and are written once. Meaning your Spark connect, your Snowflake create a table, your rest-call to a certain service, these all can be written once in a resource and all users have it available in every solid.
  </p>
  <p>
    <strong>Dagster provides a beautiful feature-rich UI called Dagit. It includes state-of-the-art <a href="https://graphql.org/">GraphQL</a> Interfaces for fetching status, starting, stopping pipelines</strong> and many more. As shown in the machine learning part, it closes the boundaries to the machine learning team with the integration of Jupyter notebooks. It's all free and <a href="https://github.com/dagster-io/dagster">open-source</a> and the team is extremely responsive on both <a href="https://dagster-slackin.herokuapp.com/">Slack</a> and <a href="https://github.com/dagster-io/dagster/">GitHub</a>.
  </p>
  <p>
    What about testing? Testing data is very hard and nothing compared to software testing as data is and even tools and framework is dynamic and can change every output of your transformation, as well the size of data changes in dev, test, and production. Dagster's abstraction supports testing profoundly. <a href="https://docs.dagster.io/tutorial/types#dagster-types">Type checks</a> and <a href="https://docs.dagster.io/tutorial/types#expectations">assertions</a> about your data are included. But I'd suggest using the first-class <a href="https://dagster.io/blog/great-expectations-for-dagster">integration</a> of <a href="https://greatexpectations.io/">Great Expectation</a>.
  </p>


<div class="x-embed-wrapper" style="display: flex; justify-content: center; margin: 1.5em 0;">
          <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Happy to announce @dagsterio&#39;s newest integration with  <a href="https://twitter.com/expectgreatdata?ref_src=twsrc%5Etfw">@expectgreatdata</a>, the open source data quality framework. See here how richly display the test results right in our tools. We deeply integrate with tools and don&#39;t just call them opaquely. Fun to work with <a href="https://twitter.com/AbeGong?ref_src=twsrc%5Etfw">@AbeGong</a> and team! <a href="https://t.co/NKRcUMY1yX">https://t.co/NKRcUMY1yX</a> <a href="https://t.co/tQ6qQ9D45F">pic.twitter.com/tQ6qQ9D45F</a></p>&mdash; Nick Schrock (@schrockn) <a href="https://twitter.com/schrockn/status/1304094805153083392?ref_src=twsrc%5Etfw">September 10, 2020</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>


        </div>
        <script>
        (function() {
          function updateXTheme() {
            var isDark = document.body.getAttribute('theme') === 'dark';
            var theme = isDark ? 'dark' : 'light';
            
            document.querySelectorAll('blockquote.twitter-tweet').forEach(function(el) {
              el.setAttribute('data-theme', theme);
            });
            
            document.querySelectorAll('.x-embed-wrapper iframe').forEach(function(iframe) {
              var src = iframe.src;
              if (src) {
                var url = new URL(src);
                if (url.searchParams.get('theme') !== theme) {
                  url.searchParams.set('theme', theme);
                  iframe.src = url.toString();
                }
              }
            });
          }
          
          updateXTheme();
          
          new MutationObserver(function(mutations) {
            mutations.forEach(function(m) {
              if (m.attributeName === 'theme') updateXTheme();
            });
          }).observe(document.body, { attributes: true });
        })();
        </script>
<p>On top of that, <strong>Dagster embraces the <a href="https://en.wikipedia.org/wiki/Functional_programming">functional programming paradigm</a></strong>. By simply writing Dagster pipelines, you are writing <strong>functional solids that are declarative, abstracted, <a href="/blog/business-intelligence-meets-data-engineering/#%E2%80%9CLoad_incremental_and_Idempotency_%E2%80%9D">idempotent</a>, type-checked to catch errors early</strong>. Dagster also includes simple <a href="https://docs.dagster.io/examples/pipeline_unittesting">unit-testing</a> and handy feature to <a href="https://docs.dagster.io/tutorial/testable#testing-solids-and-pipelines">make pipelines and solid testable and maintainable</a>.</p>
<p>All of my examples are implemented with Dagster. Just clone my repo, install Dagster and start Dagit from <a href="https://github.com/ssp-data/practical-data-engineering/tree/v1/src/pipelines/real-estate">src/pipelines/real-estate</a>. I&rsquo;m trying to build an <a href="https://github.com/ssp-data/awesome-dagster">awesome-dagster</a> with common code-blocks as solids, resources and more to be re-used by everyone. Feel free to contribute if you have nice components to add.</p>
<h3 id="devops-engine--kubernetes">DevOps engine – Kubernetes</h3>
<p>And finally, the engine everything runs on locally and <a href="https://looker.com/definitions/cloud-agnostic#:~:text=Cloud%2Dagnostic%20platforms%20are%20environments,different%20features%20and%20price%20structures.">cloud-agnostic</a> in the cloud is <a href="https://kubernetes.io/">Kubernetes</a>. Quoted from an earlier <a href="/blog/business-intelligence-meets-data-engineering">post</a> in chapter <a href="/blog/business-intelligence-meets-data-engineering/#%E2%80%9CUse_a_containerorchestration_system_%E2%80%9D">Use a container-orchestration system</a>:</p>
<blockquote><strong><a href="https://stackoverflow.blog/2020/05/29/why-kubernetes-getting-so-popular/">Kubernetes</a> has become the de-facto standard</strong> for your cloud-native apps to (auto-) <a href="https://stackoverflow.com/a/11715598/5246670">scale-out</a> and to deploy your open-source zoo fast, cloud-provider-independent. No lock-in here. You could use <a href="https://www.openshift.com/">open-shift</a> or <a href="https://www.okd.io/">OKD</a>. With the latest version, they added the <a href="https://operatorhub.io/">OperatorHub</a> where you can install as of today 182 items with just a few clicks. […] Some more reasons for Kubernetes are the <strong>move from infrastructure as code</strong> towards <strong>infrastructure as data</strong>, specifically as <a href="https://en.wikipedia.org/wiki/YAML">YAML</a>. […] Developers quickly write applications that run across multiple operating environments. Costs can be reduced by scaling down […]</blockquote>
<p>To get hands-on with Kubernetes you can install <a href="https://www.docker.com/products/docker-desktop">Docker Desktop</a> with Kubernetes included. All of <a href="http://code.sspaeti.com">my examples</a> are build on top of it and run on any cloud as well as locally. For a more sophisticated set-up in terms of Apache Spark, I suggest reading the blog post from <a href="https://www.datamechanics.co/">Data Mechanics</a> about <a href="https://www.datamechanics.co/blog-post/setting-up-managing-monitoring-spark-on-kubernetes">Setting up, Managing &amp; Monitoring Spark on Kubernetes</a>. If your more of a video guy, <a href="https://youtu.be/qcvNZvFZIP4?t=31">An introduction to Apache Spark on Kubernetes</a> contains the same content but adds still even on top of it.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We have seen that in order to apply hands-on data engineering methodologies to a real-estate project, you need to know a good amount of the latest big data tools and frameworks. As well as data architecture to assess how these fit together and can be utilised for specific use-cases. I hope I could give you some inspiration and ways to create your own data engineering project. From scraping the web to storing the data in an S3 object store, adding database features onto it, using machine learning capabilities with Jupyter notebooks, ingesting it into a data warehouse, visualise the data with a nice dashboard, connecting everything together with an orchestrator and running it cloud-agnostic.</p>
<p>If you want to test your knowledge, start the <a href="https://pixelastic.github.io/pokemonorbigdata/">Pokemon or Big Data</a> quiz, you will see it&rsquo;s not that easy 😉. If you like more <a href="/brain/open-source-data-engineering-projects/" rel="">Open-Source Data Engineering Projects</a>, I curate a list I constantly update.</p>
<p>That’s it for now. If you like the content and want to follow along, make sure you subscribe to my <a href="https://subscribe.ssp.sh/">newsletter</a>, check my <a href="http://code.sspaeti.com">code</a> on GitHub or visit me on <a href="https://www.linkedin.com/in/sspaeti/">LinkedIn</a>, or <a href="https://twitter.com/sspaeti/">Twitter</a> for genuine news about the data ecosystem.</p>
<hr>
<pre class=""><em>Republished on <a href="https://sspaeti.medium.com/building-a-data-engineering-project-in-20-minutes-85c37cad4d87">Medium</a>.</em></em></pre>
]]></description>
</item>
<item>
    <title>A Diary of a Data Engineer</title>
    <link>https://www.ssp.sh/blog/diary-of-a-data-engineer/</link>
    <pubDate>Tue, 13 Jan 2026 10:36:39 &#43;0100</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/diary-of-a-data-engineer/</guid><enclosure url="https://www.ssp.sh/blog/diary-of-a-data-engineer/featured-image.jpg" type="image/jpeg" length="0" /><description><![CDATA[<p>You ingest data. You model it. You transform it. You serve it. Someone asks for a change. Everything breaks. You rebuild. This is the loop. It was the loop in 2005 with SSIS and star schemas. It&rsquo;s the loop in 2025 with dbt and Iceberg, or 2026 with prompting AI agents.</p>
<p>The tools change. The loop doesn&rsquo;t.</p>
<h2 id="the-invisible-plumbers">The Invisible Plumbers</h2>
<p>When I started my career in 2003, there was no &ldquo;data engineering&rdquo;. There was no big data, no data science. We called it Business Intelligence. Data Warehouse Developer. ETL Developer.</p>
<p>We were the plumbers of the organization. And like plumbers, nobody noticed us until something broke.</p>
<p>Being a data engineer means: you&rsquo;re building the foundation that everyone stands on, but when the presentation goes well, the data scientist, the app developer, anyone who presents gets the applause. When the executive makes the right decision, the analyst gets the credit. When the dashboard loads in 1 second instead of 20, nobody says anything at all.</p>
<p>But when one number is wrong? When a pipeline is 10 minutes late? When someone asks for &ldquo;a small change&rdquo; and you explain it&rsquo;ll take a day, or a week to fix it?</p>
<p>That&rsquo;s when everyone notices you. And shares their opinion on how to make it better.</p>
<p>&ldquo;Why does this take so long? It&rsquo;s just one column. Why isn&rsquo;t it real-time?&rdquo;</p>
<p>They don&rsquo;t see the 147 downstream dependencies. The three systems that need a fuzzy-logic join. Or the security measures that go through three different subnetworks. The backfill that&rsquo;ll take 6 hours to run. The schema that hasn&rsquo;t been touched since 2021 because the last person who understood it left the company long ago.</p>
<p>This is the paradox of data engineering: when you do your job, you&rsquo;re invisible. When anything goes wrong, you&rsquo;re under a microscope.</p>
<h2 id="the-epochs-a-50-year-journey">The Epochs: A 50-Year Journey</h2>
<p>To understand where we are today, you need to understand where we came from.</p>
<h3 id="1970s-the-beginning">1970s: The Beginning</h3>
<p>Edgar F. Codd proposed [[SQL]] in 1970. A way to abstract the complexities of data storage. By the 1980s, it became the standard. IBM built System R. Oracle launched their RDBMS in 1979.</p>
<p>The foundation was laid. But nobody called it &ldquo;data engineering&rdquo; yet.</p>
<h3 id="1980s-1990s-the-warehouse-era">1980s-1990s: The Warehouse Era</h3>
<p>[[Bill Inmon]] formalized data warehousing principles in the 1980s. Many call him the father of data warehousing. Then in 1996, Ralph Kimball published &ldquo;[[The Data Warehouse Toolkit (Ralph Kimball)|The Data Warehouse Toolkit]]&rdquo; and gifted us with [[dimensional modeling]]—star schemas, fact tables, slowly changing dimensions.</p>
<p>These concepts? They&rsquo;re still relevant today.</p>
<h3 id="2000s-when-big-changed-everything">2000s: When &ldquo;Big&rdquo; Changed Everything</h3>
<p>The dot-com bubble burst. Tech titans were born such as Google, Amazon, Yahoo, hitting walls their databases couldn&rsquo;t scale past.</p>
<p>So Google released two [[Data Engineering Whitepapers|groundbreaking papers]]: the Google File System in 2003, MapReduce in 2004. Yahoo responded with Hadoop in 2006. Hardware prices plummeted.</p>
<p>Suddenly, we weren&rsquo;t just BI engineers anymore. We were &ldquo;<strong>Big Data Engineers</strong>&rdquo;. We had to know traditional relational databases AND the new open-source filesystems. The skillset kept expanding—from data modeling to software development to mastering Hive and Spark, all coordinated with R and Python.</p>
<p>The term &ldquo;big&rdquo; was everywhere. But how big is &ldquo;big&rdquo;? Nobody really knew. We just knew the old ways weren&rsquo;t working anymore. And Facebook and co showed us the way.</p>
<h3 id="2010s-the-cloud-changes-the-game">2010s: The Cloud Changes the Game</h3>
<p>Amazon announced AWS. Google Cloud and Azure followed. Companies no longer needed to own hardware. The flexibility was unprecedented, and we could get any DWH on demand.</p>
<p>Redshift. Snowflake. And then the open-source wave hit:</p>
<ul>
<li>Airflow for orchestration (2014)</li>
<li>Superset for visualization (2015)</li>
<li>dbt for transformation (2016)</li>
</ul>
<p>And in 2017, Maxime Beauchemin—after creating both Airflow and Superset—published &ldquo;<a href="https://medium.com/free-code-camp/the-rise-of-the-data-engineer-91be18f1e603" target="_blank" rel="noopener noreffer">The Rise of the Data Engineer</a>&rdquo;. He defined, for the first time, what data engineering actually meant. He explained the shift from business intelligence to data engineering.</p>
<p>I remember releasing my first viral article in March 2018: &ldquo;<a href="https://www.ssp.sh/blog/data-engineering-the-future-of-data-warehousing/" target="_blank" rel="noopener noreffer">Data Engineering, the future of Data Warehousing?</a>&rdquo; It got 200 likes. Back then, that was a lot 😉.</p>
<p>Since then? New technologies appeared weekly. The [[Modern Data Stack]] was born.</p>
<h3 id="2020s-devops-meets-data-engineering">2020s: DevOps Meets Data Engineering</h3>
<p>This is where it gets interesting.</p>
<p>Data engineering isn&rsquo;t just about moving data anymore. It&rsquo;s about <strong>infrastructure as code</strong>, version control for data, CI/CD pipelines, Kubernetes, Docker, and Terraform.</p>
<p>The skills needed have exploded. You need to know:</p>
<ul>
<li>SQL (still the foundation)</li>
<li>Python or Scala</li>
<li>Cloud infrastructure (AWS/GCP/Azure)</li>
<li>Linux and bash scripting</li>
<li>Git for version control</li>
<li>Data modeling (the lost art)</li>
<li>Business logic (the most important)</li>
</ul>
<p>DevOps principles are now [[The State of DevOps in Data Engineering|table stakes]]. You&rsquo;re not just building pipelines. You&rsquo;re building systems that need to self-heal, auto-scale, and deploy without downtime on any environment.</p>
<p>And today? AI agents? They&rsquo;re the latest chapter. But under all the hype is the same eternal truth: <strong>you need fresh, organized, clean data.</strong></p>
<h2 id="the-eternal-loop-same-problems-new-tools">The Eternal Loop: Same Problems, New Tools</h2>
<p>Here&rsquo;s the uncomfortable truth: we&rsquo;ve been solving the same problems for 50 years.</p>
<p>In 2005, we had SSIS and star schemas. &ldquo;The cube is rebuilding&rdquo; was the pain point.</p>
<p>In 2015, we had Hadoop and Spark. &ldquo;The cluster is full&rdquo; was the nightmare.</p>
<p>In 2025, we have dbt and Snowflake. &ldquo;The bill is how much?&rdquo; is the new horror story.</p>
<p>The tools change. The problems don&rsquo;t.</p>
<p>Last month I analyzed a 200-line dbt model as part of a larger GitHub repository. You know what it was doing? Exactly what we did in 2005 with stored procedures. Same business logic. Different syntax. I laughed. Then I cried a little. (just kidding, I didn&rsquo;t 😆)</p>
<p>An old data warehouse architect from 2003 once drew a star schema on a whiteboard in 40 seconds. It would take my team three sprints to model in Oracle Warehouse Builder (OWB). He said, &ldquo;We called it just another day at the office&rdquo;.</p>
<p>We&rsquo;re not really any smarter than the people before us. We just have better marketing 😉.</p>
<h2 id="what-actually-matters-and-what-doesnt">What Actually Matters (And What Doesn&rsquo;t)</h2>
<p>Here&rsquo;s what I&rsquo;ve learned after 20+ years.</p>
<h3 id="the-excel-file-that-saved-me">The Excel File That Saved Me</h3>
<p>I was in a coffee meeting with a finance analyst—call her Maria. Fifteen years at the company. She opened her laptop and showed me an Excel file (sometimes it was Microsoft Access DB with a custom UI!).</p>
<p>Forty-seven tabs. Formulas referencing other files on a shared drive. VBA macros from 2012. VLOOKUP nested inside SUMIF.</p>
<p>&ldquo;This is how we calculate the quarterly forecast&rdquo;.</p>
<p>From my perspective of making it available to everyone and needing to understand it, I was a little shocked. I&rsquo;d spent three weeks reverse-engineering the business logic, trying to understand it, trying to recreate it in SQL Server, and adding it to our data warehouse. But the numbers were never the same, close most of the time.</p>
<p>After having multiple such experiences, sometimes Microsoft Access databases with custom UI built in (!!!), I learned something. Though my initial reaction was shock and horror, I learned that <strong>Excel isn&rsquo;t the enemy</strong>. Excel is the <strong>business telling you what they actually need</strong>.</p>
<p>When someone asks to export to Excel, they&rsquo;re not rejecting your work. They&rsquo;re telling you something. Maybe your dashboard is too slow. Maybe they need to add a column you didn&rsquo;t think of. Maybe they just need to feel in control of their analysis.</p>
<p>Power users will overengineer everything, but ask them for the Excel file and you might get validated business logic and ETL code for free. Win-win.</p>
<h3 id="the-real-time-lie">The Real-Time Lie</h3>
<p>Everyone wants real-time. &ldquo;We need to see this data instantly&rdquo;, they say. &ldquo;For decision-making&rdquo;.</p>
<p>I always ask: &ldquo;What decision will you make differently if you see it 10 minutes sooner?&rdquo;</p>
<p>Most of the time, they can&rsquo;t answer. There&rsquo;s a small percentage that needs it: air traffic control, fraud detection, Black Friday e-commerce. But the rest? They just think real-time serves them better.</p>
<p>Real-time adds a much higher complexity. Harder debugging. Harder backfills. Harder historization. The question is, for what? So someone can watch a number update every 30 seconds instead of every hour?</p>
<p>Push back on &ldquo;real-time&rdquo;. Start with hourly refreshes. It&rsquo;s almost always enough.</p>
<h3 id="the-schema-change-a-people-problem">The Schema Change: A People Problem</h3>
<p>They said it was small. Just renaming <code>user_id</code> to <code>customer_id</code>.</p>
<p>You trace the lineage. 147 downstream dependencies. Three teams. One undocumented view from 2019 that somehow powers the CEO&rsquo;s dashboard.</p>
<p>That&rsquo;s when you realize: <strong>schema changes are usually people problems</strong>, not technology problems. The reason things break is when upstream producers don&rsquo;t own responsibility for downstream analytics and don&rsquo;t communicate the changes. There&rsquo;s no process in place. Just assumptions.</p>
<p>Fix the people process first, then update the code.</p>
<h2 id="the-lost-art-of-data-modeling">The Lost Art of Data Modeling</h2>
<p>Max Beauchemin once <a href="https://www.heavybit.com/library/podcasts/data-renegades/ep-3-building-tools-that-shape-data-with-maxime-beauchemin" target="_blank" rel="noopener noreffer">said</a> in an interview: &ldquo;I like the analysis side. I think I&rsquo;m a good data modeler. It&rsquo;s kind of a lost art, so I still do a lot of our data pipelines&rdquo;.</p>
<p>He&rsquo;s right.</p>
<p>After years of &ldquo;just dump it in the data lake&rdquo;, people are rediscovering that structure matters. Data modeling forces you to think about:</p>
<ul>
<li><strong>[[Granularity|Grain]]</strong>: What&rsquo;s the lowest level of detail we need for this data?</li>
<li><strong>[[Entity Relationship Diagram (ERD)|Relationships]]</strong>: How do these entities connect?</li>
<li><strong>[[The Goal of Business Intelligence|What the business needs]]</strong>: What user insights they cannot know today, but lie in the source system provided, or combined with other data sources.</li>
</ul>
<p>It&rsquo;s the difference between a data warehouse and a data dump.</p>
<p>But here&rsquo;s the thing: I believe AI will bring us back to the fundamentals. When AI-generated code breaks and you&rsquo;re out of context, what then? That&rsquo;s where the fundamentals save you. Data modeling. Understanding the grain. Knowing SQL deeply, not superficially<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>.</p>
<p>Someone needs to understand and refactor generated code. Someone needs to simplify. That someone is you.</p>
<h2 id="the-lost-code-we-inherit">The Lost Code We Inherit</h2>
<p>You&rsquo;ll inherit code from someone who left. Everyone does.</p>
<p>I once found a DAG called <code>final_v3_FIXED_REAL_FINAL.py</code>. Inside was a comment:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Mike: I don&#39;t know why this works. Just leave it</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Mike was right. I left it.</p>
<p><strong>The biggest pitfall?</strong> Trying to recreate everything to your taste. Accept the legacy. Adapt or improve one thing at a time. The motto &ldquo;Don&rsquo;t touch what works today&rdquo; really applies to legacy code most often.</p>
<p>Usually, the previous engineer wasn&rsquo;t naive or stupid. They were solving different problems with different constraints. Your job isn&rsquo;t to make it beautiful (sometimes it helps!). Your job is to keep it running while slowly making it better over time.</p>
<h2 id="the-books-that-actually-matter">The Books That Actually Matter</h2>
<p>As the cycles come and go, these books helped me throughout the cycle<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>, and can be applied to this day.</p>
<p><strong><a href="https://unidel.edu.ng/focelibrary/books/Designing%20Data-Intensive%20Applications%20The%20Big%20Ideas%20Behind%20Reliable,%20Scalable,%20and%20Maintainable%20Systems%20by%20Martin%20Kleppmann%20%28z-lib.org%29.pdf" target="_blank" rel="noopener noreffer">Designing Data-Intensive Applications</a></strong> by Martin Kleppmann about distributed systems and how to build them. Wait a little, version two is just around the corner.</p>
<p><strong>[[The Data Warehouse Toolkit (Ralph Kimball)|The Data Warehouse Toolkit]]</strong> by Ralph Kimball. Someone in 2045 will still need to understand fact tables and dimensional tables.</p>
<p><strong><a href="https://www.amazon.com/Fundamentals-Data-Engineering-Robust-Systems/dp/1098108302" target="_blank" rel="noopener noreffer">Fundamentals of Data Engineering</a></strong> by Joe Reis and Matt Housley. A great start to know about all the concepts and principles you hear everywhere, including in this article.</p>
<p><strong><a href="https://dedp.online/" target="_blank" rel="noopener noreffer">Patterns of Data Engineering (PoDE)</a></strong> by me. If you want, you can also read my unfinished online book, which starts with the state of the art, the history and key <a href="https://dedp.online/part-1/2-overview-dedp/understanding-convergent-evolution.html" target="_blank" rel="noopener noreffer">convergent evolution in data engineering</a>, about the ever-returning cycle we talk about here, and explains them with higher-level patterns.</p>
<p>The first two books don&rsquo;t mention Snowflake, Lakehouse or dbt. They mention problems that existed in 1995 and will exist in 2045. That&rsquo;s the Lindy Effect, and how you know they&rsquo;re worth reading.</p>
<h2 id="what-i-know-now-that-i-wish-i-knew-then">What I Know Now That I Wish I Knew Then</h2>
<p>If I could go back to 2003 and talk to my younger self, here&rsquo;s what I might say. Boy:</p>
<p><strong>1. The tools will change. The fundamentals won&rsquo;t.</strong></p>
<p>Stop chasing every new framework. Learn [[data modeling]]. Learn how data is flowing. Learn SQL deeply, not superficially. Learn how humans make decisions. Everything else is syntax.</p>
<p>In 2026, AI helps us write code faster. But someone still needs to understand the fundamentals, the [[Data Engineering Lifecycle]]. That someone can be you.</p>
<p><strong>2. Talk to the business people.</strong></p>
<p>This is a crucial lesson in my journey. What you&rsquo;ll learn from them will make you inevitably a better data engineer. Technical skills can be learned, outsourced, automated. Knowledge about the business is much harder.</p>
<p>The best data engineers aren&rsquo;t the ones who know every new tool. They&rsquo;re the ones who know <em>why</em> the data matters.</p>
<p><strong>3. You&rsquo;re building the foundation, not the showcase.</strong></p>
<p>When the presentation goes well, the data scientist, the AI engineer gets the credits. When the executive makes the right decision, the analyst gets credit. When the dashboard loads fast, nobody says anything.</p>
<p>But when one number is wrong? Everyone sees you.</p>
<p>Accept this. You&rsquo;re a plumber. Be the <strong>best plumber in the world</strong> and make sure nobody ever thinks about the pipes.</p>
<p><strong>4. Data quality is learned through pain.</strong></p>
<p>You can&rsquo;t understand data quality from a textbook. You need to see bad data. If you start looking, it won&rsquo;t take long, and you&rsquo;ll see really bad, production-breaking data. That will teach you what good data looks like.</p>
<p>And you&rsquo;ll only get faster by talking to the people who use it.</p>
<p><strong>5. Presentation matters more than you think.</strong></p>
<p>No matter how fancy your pipeline, how elegant your code, how profound your insights—if the presentation isn&rsquo;t right or the data quality is terrible, no one cares.</p>
<p>Throughout my career, presenting data understandably has been as important as building the pipeline. That&rsquo;s why these days, I focus on the <a href="https://craft.ssp.sh/" target="_blank" rel="noopener noreffer">storyline and craft</a> extensively.</p>
<p><strong>6. Set boundaries early.</strong></p>
<p>This job will take everything you give it. The people who succeed aren&rsquo;t the ones who work 80-hour weeks. Sure, in the beginning you need it here and there too. But over time, you need to learn to [[Hell Yeah or No|say no]]. Document things so you can take vacation. Build systems that don&rsquo;t require you to be online at 3 AM.</p>
<p>Future you will thank you.</p>
<p><strong>7. Don&rsquo;t chase every trend.</strong></p>
<p>Data engineering is still going strong. Stronger than ever. AI won&rsquo;t take our jobs any time soon. The opposite is true. There will be more chaos, and people who know how to model data, understand business requirements, and deliver high-quality insights will always be needed.</p>
<p>Plus, every AI solution out there needs data, a lot of data, and probably a plumber to fix the pipeline too. Use the knowledge of past years building data pipelines. We don&rsquo;t need to rebuild everything every 5 years.</p>
<h2 id="the-loop-continues">The Loop Continues</h2>
<p>It&rsquo;s 2026. I&rsquo;m building a pipeline with DuckDB and Rill. The business wants faster dashboards and better insights. They want to edit data themselves. They want to use an AI chatbot. Or sometimes they just rename a column in the source system without telling anyone.</p>
<p>Here we go again.</p>
<p>But here&rsquo;s the thing: I still love it. Especially when I can write about the learnings.</p>
<p>I don&rsquo;t miss the late nights or the schema changes or the never-ending rewrites. I love the moment when you finally get the data right and someone in finance sees something they&rsquo;ve never seen before. When a dashboard actually changes a decision. When the CEO asks a question and you can answer it with data.</p>
<p>That&rsquo;s the job. Not the tools. Not the frameworks. Not the buzzwords.</p>
<p>The moment when data helps a human make a better decision.</p>
<h2 id="the-final-truth">The Final Truth</h2>
<p>The tools will change. The vendors will rise and fall. Snowflake will be replaced by something else. The latest new shiny tool will become the legacy tomorrow. AI agents will be the next big thing, and then something after that.</p>
<p>But someone, somewhere, will always need to:</p>
<ul>
<li>Understand the grain of a business</li>
<li>Know why the numbers don&rsquo;t match</li>
<li>Explain to the CEO that the data they want doesn&rsquo;t exist yet</li>
<li>Debug why a pipeline broke at 2 AM</li>
<li>Figure out why production data looks different from dev data</li>
</ul>
<p>That someone is you.</p>
<p>You&rsquo;re the invisible plumber. The unsung engineer. The person who makes sure the foundation doesn&rsquo;t crumble while everyone else builds on <a href="https://xkcd.com/2347/" target="_blank" rel="noopener noreffer">top of it</a>.</p>
<p>And honestly? It&rsquo;s a pretty damn good job if you like to work quietly, helping a large part of the business.</p>
<p>Because 50 years from now, when we&rsquo;re using tools we can&rsquo;t even imagine today, someone will still be ingesting data, modeling it, transforming it, serving it. Someone will ask for a change. Something will break.</p>
<p>The loop continues. The problems remain. Only the tools change.</p>
<p>And that&rsquo;s okay. Isn&rsquo;t that somehow beautiful? Because beneath all the hype, all the new frameworks, all the promises of &ldquo;this time it&rsquo;s different&rdquo;—there&rsquo;s you, the data engineer 😉. Understanding the data. Knowing the business. Building the foundation.</p>
<p>That&rsquo;s <em>[[Data Engineering]]</em>.</p>
<blockquote>
<p>[!tip] Inspiration</p>
<p><em>This piece was inspired by the confessional storytelling style of <a href="https://www.youtube.com/@TheDiaryOfACEO" target="_blank" rel="noopener noreffer">Diary of a CEO</a>. If you enjoyed this format applied to data engineering, let me know—I&rsquo;d love to hear your own stories from the field.</em></p>
</blockquote>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>I wrote more at [[Will AI replace Humans|Will AI Replace Human Thinking]]&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>I collect [[Books of Data Engineering]] at my data engineering brain, find more interesting once there too.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></description>
</item>
<item>
    <title>Summer Data Engineering Roadmap</title>
    <link>https://www.ssp.sh/blog/data-engineering-roadmap/</link>
    <pubDate>Wed, 06 Aug 2025 14:04:08 &#43;0200</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/data-engineering-roadmap/</guid><enclosure url="https://www.ssp.sh/blog/data-engineering-roadmap/featured-image.png" type="image/png" length="0" /><description><![CDATA[<p>With this summer edition, you&rsquo;ll have a roadmap for your vacation time to learn the basics of being a full-stack data engineer. Fill your knowledge gaps, refresh the basics, or learn with a curated list and path towards a full-time data engineer.</p>
<p>After covering the essential toolkit in <a href="/blog/data-engineering-toolkit/" rel="">Part 1</a> (essential tools for your machine) and <a href="/blog/data-engineering-toolkit-devops-iac/" rel="">Part 2</a> (infrastructure and DevOps), this article teaches you <strong>how</strong> and in <strong>what order</strong> to learn these skills. The roadmap provides a structured path to level up during the slower summer months.</p>
<p>The roadmap is organized into 3 weeks that you can learn at your own pace and time availability:</p>
<ul>
<li><strong>Week 1</strong>: Foundation (SQL, Git, Linux basics)</li>
<li><strong>Week 2</strong>: Core Engineering (Python, Cloud, Data Modeling)</li>
<li><strong>Week 3</strong>: Advanced Topics (Streaming, Data Quality, DevOps)</li>
</ul>
<p>![[de-roadmap.webp]]</p>
<p><strong>How to use this guide</strong>: Each section contains curated resources (articles, videos, tutorials) for that topic. Click on the links that interest you most. It&rsquo;s meant as a guided roadmap to learn the fundamentals of a &ldquo;full stack&rdquo; data engineer.</p>
<blockquote>
<p>[!tip] Learning at Your Own Pace</p>
<p>While structured as a three-week program, everyone learns differently. Pick what&rsquo;s most relevant to your goals and skip sections you won&rsquo;t need immediately or in the near-term future. Consistency matters more than speed. Sometimes we forget how far 30 minutes a day can take us. And no, after three weeks, you won&rsquo;t know everything you need to know, but you&rsquo;ll be able to understand the problems and identify potential angles to solve them.</p>
</blockquote>
<h2 id="week-1-foundation-and-core-skills">Week 1: Foundation and Core Skills</h2>
<p>Let&rsquo;s get started with building your technical foundation skills for data engineering.</p>
<p>You can learn the foundational skills in many ways: there are bootcamps, courses, blogs, YouTube videos, hands-on projects, and many more ways to learn them (free and paid ones), including the more advanced skills.</p>
<h3 id="sql-foundations">SQL Foundations</h3>
<p>Probably the most important skill of any data engineer, at any level, whether they are closer to the business or more technical, is SQL—the language of data. You can descriptively explain what you want from your data much more precisely than natural language through LLM workflows. That&rsquo;s why it will always be a core skill. For example, in the English language, you won&rsquo;t specify the partitions or the exact date range (including or excluding the current month). There are many questions that you need to define in your WHERE statement or in the SELECT, which you would miss otherwise.</p>
<p>To get started with SQL until you master it, you can follow this roadmap below:</p>
<ul>
<li>Start with <a href="https://www.w3schools.com/sql/sql_intro.asp" target="_blank" rel="noopener noreffer">understanding SQL</a>.</li>
<li>Database design principles, from <a href="https://www.freecodecamp.org/news/learn-relational-database-basics-key-concepts-for-beginners/" target="_blank" rel="noopener noreffer">relational database basics to key concepts for beginners</a>. Learn DDL (<code>ALTER</code>, <code>CREATE</code>), DML (<code>INSERT</code>, <code>UPDATE</code>, <code>DELETE</code>), and <a href="https://www.geeksforgeeks.org/dbms/introduction-of-relational-model-and-codd-rules-in-dbms/" target="_blank" rel="noopener noreffer">relational theory by Edgar F. Codd</a>, who invented the theoretical basis for relational databases.</li>
<li>Advanced SQL queries, such as <a href="https://mode.com/sql-tutorial/sql-window-functions/" target="_blank" rel="noopener noreffer">Window functions</a> for performing advanced aggregations without additional subqueries within the current query. Or, <a href="https://www.sqltutorial.org/sql-cte/" target="_blank" rel="noopener noreffer">CTEs</a> are a powerful syntax that allows for better readability, creating aliases for sub-queries, and even recursion is possible.</li>
<li><a href="https://www.geeksforgeeks.org/dbms/acid-properties-in-dbms/" target="_blank" rel="noopener noreffer">ACID properties and transactions</a> within databases such as Postgres, MySQL, and DuckDB.</li>
<li>Learn the differences between OLTP vs. OLAP with a <a href="https://www.datacamp.com/blog/oltp-vs-olap" target="_blank" rel="noopener noreffer">beginner&rsquo;s guide</a>. Also, check out an explainer of <a href="https://motherduck.com/learn-more/what-is-OLAP/" target="_blank" rel="noopener noreffer">What is OLAP?</a></li>
<li><a href="https://medium.com/@suffyan.asad1/getting-started-with-dbt-data-build-tool-a-beginners-guide-to-building-data-transformations-28e335be5f7e" target="_blank" rel="noopener noreffer">dbt core</a> and <a href="https://thedatatoolbox.substack.com/p/getting-started-with-sqlmesh-a-comprehensive" target="_blank" rel="noopener noreffer">SQLMesh</a>: frameworks to encapsulate SQL into a structure that can be versioned, tested, and run in order, including well-documented lineage as a web page.</li>
</ul>
<h3 id="version-control">Version Control</h3>
<p>If you use SQL, very quickly you&rsquo;ll want to work with coworkers and want to version it so as not to lose essential changes or to roll back added bugs.</p>
<p>Therefore, you need version control. This short chapter gives you some starting points for the most common one.</p>
<ul>
<li>What is version control - <a href="https://betterexplained.com/articles/a-visual-guide-to-version-control/" target="_blank" rel="noopener noreffer">a visual guide to version control</a>.</li>
<li>The tool, <a href="https://www.coursera.org/learn/version-control-with-git" target="_blank" rel="noopener noreffer">Git fundamentals</a>.</li>
<li>GitHub/GitLab Collaboration: Learn about platforms like GitHub and GitLab for hosting Git repositories and for sharing and collaborating with others. Main features include Pull Requests and Issues for <a href="https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests" target="_blank" rel="noopener noreffer">communicating your changes in a structured</a> way.</li>
<li>Learn the different <a href="https://www.atlassian.com/git/tutorials/comparing-workflows" target="_blank" rel="noopener noreffer">git workflows</a>. Also, check out <a href="https://dev.to/yankee/practical-guide-to-git-worktree-58o0" target="_blank" rel="noopener noreffer">git worktree</a>. Although it&rsquo;s a bit advanced, it&rsquo;s good to know it&rsquo;s there, especially if you need to <strong>work on different branches simultaneously</strong> without constantly stashing or committing your unfinished changes before switching to another branch.</li>
</ul>
<p>There are many more helpful topics, such as GitHub Actions/Pipelines for CI/CD or basic automation (uploading documents to a website, checking grammar automatically before publishing, etc.). However, for the first week, let&rsquo;s keep it simple and move on to the next chapter: Linux and scripting.</p>
<h3 id="environment-setup-linux-fundamentals--basic-scripting">Environment Setup, Linux Fundamentals &amp; Basic Scripting</h3>
<p>Set up your development environment and master essential Linux skills for data engineering. This depends on your operating system of choice, too, but most data engineering tasks are typically run on servers. In almost all cases, they are executed on Unix-based systems. That&rsquo;s why Linux fundamentals are key to elevating your data engineering skills.</p>
<p>Below are the resources and roadmap to learn about these topics:</p>
<ul>
<li><a href="https://www.freecodecamp.org/news/bash-scripting-tutorial-linux-shell-script-and-command-line-for-beginners/" target="_blank" rel="noopener noreffer">Bash scripting essentials</a>, starting with the basics of bash scripting, including variables, commands, inputs/outputs, and debugging. Alternatively, use this course with an interactive command line in the browser: <a href="https://www.codecademy.com/learn/learn-the-command-line" target="_blank" rel="noopener noreffer">Linux command line basics</a> (Paid).</li>
<li>Package managers (Apt, yum, Homebrew, Wget): <a href="https://www.geeksforgeeks.org/techtips/apt-and-yum-package-managers-in-linux/" target="_blank" rel="noopener noreffer">How to Use Package Managers in Linux? (APT and YUM)</a> and <a href="https://brew.sh/" target="_blank" rel="noopener noreffer">Homebrew for macOS</a></li>
<li><a href="https://www.hostinger.com/tutorials/ssh-tutorial-how-does-ssh-work" target="_blank" rel="noopener noreffer">SSH and remote connections</a>: Connecting to a remote server and fixing a DAG or updating a script on the fly.</li>
<li>Development environment setup: Simple yet powerful dev setups:  <a href="https://ghostinthedata.info/posts/2025/2025-02-02-setting-up-your-data-engineering-environment-on-macos/" target="_blank" rel="noopener noreffer">MacOS setup</a> with pyenv, docker, uv, VSCode, Linux (<a href="https://github.com/basecamp/omakub" target="_blank" rel="noopener noreffer">Omakub</a>, <a href="https://github.com/basecamp/omarchy" target="_blank" rel="noopener noreffer">Omarchy</a>) and <a href="https://medium.com/bitgrit-data-science-publication/how-to-setup-a-windows-laptop-for-data-science-e56ee3f0dcf0" target="_blank" rel="noopener noreffer">Windows Setup for data scientist</a>.</li>
<li><a href="https://ostechnix.com/a-beginners-guide-to-cron-jobs/" target="_blank" rel="noopener noreffer">Cron jobs and scheduling</a>: Basic automation scripts without the need for a heavy tool.</li>
</ul>
<p>Congratulations, this wraps up week one. If you have watched, experimented, and taken notes, you now possess the fundamentals of data engineering and, frankly, any engineering or technical job. Give yourself some time to ponder and review, and then proceed to week two below.</p>
<h2 id="week-2-core-data-engineering">Week 2: Core Data Engineering</h2>
<p>Week two is all about the essential data concepts, primarily established principles for manipulating and architecting data flows for data engineering tasks.</p>
<h3 id="data-modeling--warehousing">Data Modeling &amp; Warehousing</h3>
<p>To avoid creating independent SQL queries and persistent data tables without connected data sets, we need to model our data with a more holistic approach.</p>
<p>This is where the concepts of so-called data modeling and the long-standing term data warehousing originate. The sole purpose of these is to organize data optimized for consumption, whereas data in Postgres and other operational databases is optimized for storage.</p>
<p>This chapter will teach you and point you to key knowledge to prepare you to model enterprise workloads.</p>
<ul>
<li><strong><a href="https://www.integrate.io/blog/mastering-data-warehouse-modeling/" target="_blank" rel="noopener noreffer">Data modeling</a></strong> is a significant one, and somewhat underappreciated these days. However, with the rise of AI and automation, it hasn&rsquo;t been more critical to learn.
<ul>
<li><a href="https://www.getdbt.com/blog/guide-to-dimensional-modeling" target="_blank" rel="noopener noreffer">Dimensional modeling</a> with a <a href="https://learn.microsoft.com/en-us/fabric/data-warehouse/dimensional-modeling-overview" target="_blank" rel="noopener noreffer">star schema</a>.</li>
<li><a href="https://www.datacamp.com/blog/star-schema-vs-snowflake-schema" target="_blank" rel="noopener noreffer">Snowflake schema vs star schema</a>: Understanding when to use normalized vs denormalized dimension tables.</li>
<li><a href="https://www.freecodecamp.org/news/database-normalization-1nf-2nf-3nf-table-examples/" target="_blank" rel="noopener noreffer">Data normalization</a>: 1NF, 2NF, 3NF principles for reducing data redundancy</li>
<li><a href="https://www.montecarlodata.com/blog-fact-vs-dimension-tables-in-data-warehousing-explained/" target="_blank" rel="noopener noreffer">Fact tables vs dimension tables</a>: Understanding measures, metrics, and descriptive attributes.</li>
<li><a href="https://www.ssp.sh/brain/granularity/" target="_blank" rel="noopener noreffer">Granularity</a> is a key concept to understand, so your facts will not suffer from too low detail that is slow, or too high-level detail that loses crucial information when drilling down in a dashboard.</li>
</ul>
</li>
<li>Data warehouse design methodologies:
<ul>
<li><a href="https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/" target="_blank" rel="noopener noreffer">Kimball methodology</a>: Bottom-up, business process-focused approach.</li>
<li><a href="https://www.astera.com/type/blog/data-warehouse-concepts/" target="_blank" rel="noopener noreffer">Inmon methodology</a>: Top-down, enterprise data model approach.</li>
<li><a href="https://www.scalefree.com/blog/data-vault/quick-guide-of-a-data-vault-2-0-implementation/" target="_blank" rel="noopener noreffer">Data Vault 2.0</a>: An approach with hubs, links, and satellites for agility and scalability.</li>
</ul>
</li>
<li>Advanced modeling concepts:
<ul>
<li><a href="https://learn.microsoft.com/en-us/fabric/data-factory/slowly-changing-dimension-type-two" target="_blank" rel="noopener noreffer">Slowly changing dimensions</a>: Handling changes in dimension data over time.</li>
<li><a href="https://www.kimballgroup.com/2012/02/design-tip-142-building-bridges/" target="_blank" rel="noopener noreffer">Bridge tables and many-to-many relationships</a>: Managing complex relationships in dimensional models.</li>
</ul>
</li>
</ul>
<h3 id="python-for-data-engineering--workflow-orchestration">Python for Data Engineering &amp; Workflow Orchestration</h3>
<p>After SQL, Python is the next most important language to learn. While it&rsquo;s beneficial to have deep knowledge about SQL, and you only need preliminary Linux skills to get around a server and run some commands from the command line, Python is the utility language of data. It&rsquo;s the <strong>glue code that connects everything</strong> you can&rsquo;t achieve with SQL, most notably working with external systems and orchestrating your data workflows with Python libraries and frameworks.</p>
<p>Orchestration and other more modern tools help you automate and organize, as well as version your data tasks and pipelines.</p>
<ul>
<li>Starting with a <a href="https://realpython.com/python-beginner-tips/" target="_blank" rel="noopener noreffer">Python general introduction</a>.</li>
<li><a href="https://motherduck.com/blog/duckdb-python-e2e-data-engineering-project-part-1/" target="_blank" rel="noopener noreffer">DataFrame and data manipulation</a> with Pandas, Polars and <a href="https://www.youtube.com/watch?v=ZX5FdqzGT1E" target="_blank" rel="noopener noreffer">DuckDB</a>. <a href="https://motherduck.com/learn-more/dataframes/" target="_blank" rel="noopener noreffer">Navigating the Dataframe Landscape</a> and <a href="https://motherduck.com/blog/duckdb-versus-pandas-versus-polars/" target="_blank" rel="noopener noreffer">DuckDB vs Pandas vs Polars for Python Developers</a>, <a href="https://www.youtube.com/watch?v=4DIoACFItec" target="_blank" rel="noopener noreffer">Video Format</a></li>
<li>Python libraries for <a href="https://realpython.com/python-pydantic/" target="_blank" rel="noopener noreffer">Data validation with Pydantic</a> or <a href="https://docs.pytest.org/en/stable/getting-started.html" target="_blank" rel="noopener noreffer">Data Testing with pytest</a>.</li>
<li>Utilitarian Python knowledge. Connecting to any API quickly with <a href="https://fastapi.tiangolo.com/tutorial/" target="_blank" rel="noopener noreffer">FastAPI</a>.</li>
<li>Workflow orchestration is almost as important as the Python language itself. <a href="https://airflow.apache.org/docs/apache-airflow/stable/index.html" target="_blank" rel="noopener noreffer">Apache Airflow</a> is the biggest name. You learn about task dependencies and scheduling, as well as how orchestration and integration of data tools and stacks work through workflow management. Also, check out related <a href="https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html" target="_blank" rel="noopener noreffer">DAG design patterns</a> for guidance on designing pipelines that are easy to maintain and separate business logic from technical logic in an organized and conventional manner.</li>
</ul>
<blockquote>
<p>[!example] Example Data Sets to Test for Yourself</p>
<p>To manipulate data or create an example project, you can use the provided datasets out of the box with DuckDB: <a href="https://motherduck.com/docs/getting-started/sample-data-queries/datasets/" target="_blank" rel="noopener noreffer">Example Datasets</a>, containing interesting datasets such as HackerNews, Foursquare, PyData, StackOverflow, and many more.</p>
</blockquote>
<h3 id="cloud-platforms-introduction">Cloud Platforms Introduction</h3>
<p>Getting to know major cloud platform providers can save you a significant amount of time and enhance your employability because you know how to work around permissions, the services provided, and how to automate specific tasks. Ensure you select the right provider based on your location and primary use, or the company you prefer to work for.</p>
<ul>
<li>Introduction to <a href="https://aws.amazon.com/getting-started/" target="_blank" rel="noopener noreffer">AWS</a>, <a href="https://azure.microsoft.com/en-us/get-started" target="_blank" rel="noopener noreffer">Azure</a>, or <a href="https://cloud.google.com/docs/get-started/" target="_blank" rel="noopener noreffer">Google Cloud</a>. Vital is permission management, such as security and IAM basics, on all platforms.</li>
<li>Dedicated data services: <a href="https://motherduck.com/" target="_blank" rel="noopener noreffer">MotherDuck</a>, <a href="https://cloud.google.com/bigquery/docs/quickstarts" target="_blank" rel="noopener noreffer">BigQuery</a>, <a href="https://learn.microsoft.com/en-us/fabric/" target="_blank" rel="noopener noreffer">Fabric</a>, <a href="https://cloud.google.com/composer/docs" target="_blank" rel="noopener noreffer">hosted Airflow</a> (Azure &amp; AWS).</li>
<li><a href="https://lakefs.io/blog/object-storage/" target="_blank" rel="noopener noreffer">Object Storage or blob storage setup</a> on all platforms.</li>
</ul>
<p>Depending on where your resume positions you, you&rsquo;ll do different work. But some sort of analytics through business intelligence (BI) is always involved. Visualizing your data and showing it in a way that makes sense immediately is hard; that&rsquo;s where BI tools and data visualization come into play.</p>
<ul>
<li>Introduction to BI tools and using <a href="https://motherduck.com/docs/getting-started/interfaces/motherduck-quick-tour/" target="_blank" rel="noopener noreffer">notebooks</a>. Others are Jupyter Notebooks, Hex, DeepNote, and many more. Check <a href="https://www.geeksforgeeks.org/data-analysis-and-visualization-with-jupyter-notebook/" target="_blank" rel="noopener noreffer">Jupyter notebooks for analytics</a>, which is a super helpful toolkit for data analysis and iteration.</li>
<li><a href="https://atlan.com/metrics-layer/" target="_blank" rel="noopener noreffer">Metrics and KPI design</a> with metrics layers and semantics.</li>
<li><a href="https://www.toptal.com/designers/data-visualization/data-visualization-best-practices" target="_blank" rel="noopener noreffer">Data visualization best practices</a>. Tools like <a href="https://www.datawrapper.de/blog/10-ways-to-use-fewer-colors-in-your-data-visualizations" target="_blank" rel="noopener noreffer">color management</a> and a <a href="https://vega.github.io/vega-lite/" target="_blank" rel="noopener noreffer">high-level grammar of interactive graphics</a> help understand data presentation. <a href="https://www.controlling-strategy.com/hichert-success-regeln.html" target="_blank" rel="noopener noreffer">Hichert SUCCESS Rules</a> is another great option, although it is only available in German. Check also <a href="https://www.youtube.com/watch?v=F9yHuAO50PQ&amp;t=2s" target="_blank" rel="noopener noreffer">Data Visualization with Hex/Preset and DuckDB/MotherDuck</a>.</li>
<li><a href="https://www.rilldata.com/blog/has-self-serve-bi-finally-arrived-thanks-to-ai" target="_blank" rel="noopener noreffer">Self-service analytics</a> enables business people to serve themselves.</li>
</ul>
<p>This concludes Week Two. You&rsquo;re ready to tackle the advanced topics in Week three.</p>
<h2 id="week-3-advanced-topics">Week 3: Advanced Topics</h2>
<p>This final week focuses on advanced topics, including data quality and streaming. This last part of the data engineering roadmap focuses on cost optimization, data quality, event-driven approaches, DevOps learnings, and advanced data quality and observability.</p>
<p>Some of these topics are rarer approaches and should be avoided initially, but there&rsquo;s a time when you need any of them.</p>
<h3 id="stream-processing--event-driven-data">Stream Processing &amp; Event-Driven Data</h3>
<p>Event-driven approaches or integrating your data as a stream, end-to-end from source to your analytics, is sometimes a must and business-critical, especially for ad-tech or sports, where you need live results that are as up-to-date as possible.</p>
<p>Understanding stream processing fundamentals is especially beneficial for validating users&rsquo; requests for real-time data insights, as they will often ask for it, but it&rsquo;s not always necessary.</p>
<ul>
<li><a href="https://codeopinion.com/change-data-capture-event-driven-architecture/" target="_blank" rel="noopener noreffer">Event-driven architecture</a> and design practices: How do they differ from batch loads? Key players in this category are <a href="https://howtodoinjava.com/kafka/apache-kafka-tutorial/" target="_blank" rel="noopener noreffer">Apache Kafka</a> and <a href="https://dev.to/mage_ai/getting-started-with-apache-flink-a-guide-to-stream-processing-e19" target="_blank" rel="noopener noreffer">Flink</a>.</li>
<li>Real-time analytics patterns: <a href="https://www.datacamp.com/blog/change-data-capture" target="_blank" rel="noopener noreffer">Change Data Capture (CDC)</a> and the difference in propagating that stream compared to batch. See <a href="https://bryteflow.com/postgres-cdc-6-easy-methods-capture-data-changes/" target="_blank" rel="noopener noreffer">Postgres change data capture possibilities</a>.</li>
</ul>
<h3 id="data-quality--testing">Data Quality &amp; Testing</h3>
<p>Implementing robust data quality frameworks and testing strategies is crucial for maintaining a stable data platform. Most often, it&rsquo;s quick to set up a data platform, or a stack to extract analytics from your data, but doing it stably and with high data quality is an entirely different job. The tools in this chapter will help you with that.</p>
<ul>
<li>Great Expectations and other <a href="https://www.startdataengineering.com/post/implement_data_quality_with_great_expectations/" target="_blank" rel="noopener noreffer">data quality frameworks</a>.</li>
<li><a href="https://docs.dagster.io/guides/test/unit-testing-assets-and-ops" target="_blank" rel="noopener noreffer">Unit testing for data pipelines</a>: How to test your data and pipelines in an automated fashion.</li>
<li><a href="https://www.montecarlodata.com/blog-data-lineage/" target="_blank" rel="noopener noreffer">Data lineage and governance</a>: How to get the lineage of your data flow.</li>
<li><a href="https://sixthsense.rakuten.com/blog/Demystifying-Data-Observability-A-Beginners-Guide-for-2025" target="_blank" rel="noopener noreffer">A Beginner’s Guide for Observability</a>. Be sure to learn about <a href="https://atlan.com/data-contracts/" target="_blank" rel="noopener noreffer">Data Contracts</a>, a concept for defining data interfaces between data and business teams.</li>
<li><a href="https://www.informatica.com/resources/articles/what-is-metadata-management.html" target="_blank" rel="noopener noreffer">Metadata Management</a>: Data discovery with data catalogs, ratings of datasets to know which ones are actively used and of good quality. Check also the <a href="https://docs.confluent.io/platform/current/schema-registry/fundamentals/data-contracts.html" target="_blank" rel="noopener noreffer">Schema registry management</a> to handle metadata.</li>
</ul>
<h3 id="cost-optimization--resource-management">Cost Optimization &amp; Resource Management</h3>
<p>Most of the time, especially if you use cloud solutions, the price to pay for these services is relatively high. Therefore, stopping the creation of a heavy temp table on an hourly basis can save a significant amount of costs. Consequently, it&rsquo;s crucial to debug heavy SQL queries or wasted orchestration tasks, including orphaned ones that aren&rsquo;t connected to any upstream datasets or that aren&rsquo;t in use.</p>
<p>Stacks that don&rsquo;t run in the cloud are optimized differently. Here, you don&rsquo;t pay for cloud services, but to run your own. That&rsquo;s why you optimize for team members and tasks. As data engineering tasks are elaborate, <strong>spending time on the right tasks</strong> can <strong>save a lot of money</strong>, too.</p>
<p>In the past, it was referred to as performance tuning. At that time, we were optimizing for speed, which remains the case today. Similarly, if you maximize performance, you also improve cost efficiency at the same time, as it runs for shorter periods. Over time, this can result in significant savings.</p>
<ul>
<li><a href="https://spot.io/resources/cloud-cost/cloud-cost-optimization-15-ways-to-optimize-your-cloud/" target="_blank" rel="noopener noreffer">Cloud cost monitoring and optimization</a>: Tools to monitor the cost and usage of data engineering tasks.</li>
<li><a href="https://mode.com/sql-tutorial/sql-performance-tuning" target="_blank" rel="noopener noreffer">Performance Tuning</a>: Indexing, partitioning strategies, and caching mechanisms are important components, as is <a href="https://turbo360.com/blog/significance-of-sql-query-consumption-analysis" target="_blank" rel="noopener noreffer">query optimization for better efficiency</a> and lower cost.</li>
<li><a href="https://min.io/product/automated-data-tiering-lifecycle-management" target="_blank" rel="noopener noreffer">Storage tiering and lifecycle management</a></li>
</ul>
<h3 id="infrastructure-as-code--devops">Infrastructure as Code &amp; DevOps</h3>
<p>Infrastructure management and deploying new software in an automated fashion typically occurs through Infrastructure as Code (IaC) using Kubernetes or a similar platform. That&rsquo;s why it&rsquo;s good to have preliminary knowledge about these tools and when to use them.</p>
<ul>
<li>Docker containerization is a good start; here&rsquo;s <a href="https://www.datacamp.com/tutorial/docker-tutorial" target="_blank" rel="noopener noreffer">a beginner&rsquo;s guide</a>.</li>
<li><a href="https://kubernetes.io/docs/tutorials/" target="_blank" rel="noopener noreffer">Kubernetes</a> and <a href="https://developer.hashicorp.com/terraform/tutorials" target="_blank" rel="noopener noreffer">Terraform</a> basics.</li>
<li><a href="https://medium.com/@mcgeejasond/devops-monitoring-and-logging-explained-939c3b5e17c4" target="_blank" rel="noopener noreffer">Monitoring and logging explained</a>.</li>
<li><a href="https://circleci.com/blog/learn-iac-part02/" target="_blank" rel="noopener noreffer">Advanced CI/CD</a> for deploying entire data stacks and data platforms.</li>
</ul>
<p>That&rsquo;s it. This is a three-week roadmap with numerous courses and links to help you learn data engineering. Let&rsquo;s take a break and dive into the final part, observing what we&rsquo;ve learned throughout these three weeks.</p>
<h2 id="congratulations-youve-learned-the-essentials-of-data-engineering">Congratulations, You&rsquo;ve Learned the Essentials of Data Engineering</h2>
<p>This roadmap provides the foundation, but data engineering is a field that requires continuous learning. Stay curious, build projects, and connect with the community. The skills you&rsquo;ve developed here will serve as your starting point into more specialized areas as you grow in your career.</p>
<p>A quick recap of what you have learned. By the end of this 3-week roadmap, you should have learned a lot, especially the key components of data engineering. With a little bit of picking and choosing, it should have been fun to engage in new, interesting, and potentially unknown topics.</p>
<p>By <strong>Week 1</strong>, you learned how to write SQL to query the data you want, and some additional functions that SQL provides that you didn&rsquo;t know before. You know how to safely version control your SQL statements and collaborate with others on them. And you have some basic Linux skills.</p>
<p>After <strong>Week 2</strong>, you can navigate and use a cloud-based data warehouse on one of the major cloud providers of your choice. You learned different ways to model your data and its flow, as well as which Python libraries and helper frameworks are available.</p>
<p><strong>Week 3</strong> enables you to understand basic analytics skills and present data to clients. You know how to implement the glue code between SQL and run it on Linux using workflow orchestration tools. You have a rough idea of what real-time data workloads look like and how they differ from batch workloads. You should have an understanding of how to package production-ready code for deploying scalable data stacks using DevOps tools and methodologies. You have heard and seen various approaches to architecting an enterprise data platform.</p>
<h3 id="whats-next">What&rsquo;s Next?</h3>
<p>All of it will help you <strong>build your portfolio</strong> and land your dream data engineering role. Each week builds upon the previous, creating a comprehensive learning experience that mirrors real-world data engineering challenges.</p>
<p>Throughout the entire process, it&rsquo;s beneficial to build your online portfolio, where you showcase your data engineering learnings, Git projects, website, and links to hackathons you participated in, among other things that demonstrate your motivation. Above all, sharing is also fun; people will reach out to you after reading your content, especially if they learn from it too.</p>
<p>Remember to take your time learning new concepts. If you give yourself time to digest, you learn more easily, you&rsquo;ll be able to recall specific terms better, and it&rsquo;s easier to connect the knowledge—this is how our brains learn.</p>
<p>Consistency is key. Dedicate 1-2 hours daily for a couple of weeks, and you&rsquo;ll be amazed at what compounding and consistent learning can achieve.</p>
<hr>
<p>I hope you enjoyed this write-up. If so, you may also find the essential toolkit article for data engineers, available in <a href="/blog/data-engineering-toolkit/" rel="">Part 1</a> and <a href="/blog/data-engineering-toolkit-devops-iac/" rel="">Part 2</a>, or check an <a href="https://www.youtube.com/watch?v=3pLKTmdWDXk&amp;t=1s" target="_blank" rel="noopener noreffer">End-To-End Data Engineering Project</a> with Python and DuckDB.</p>
<p>If you want more? Check out the <a href="https://motherduck.com/learn-more/" target="_blank" rel="noopener noreffer">Mastering Essentials</a> resources by MotherDuck, or follow their <a href="https://www.youtube.com/@motherduckdb" target="_blank" rel="noopener noreffer">YouTube channel</a> for additional resources. If you like DuckDB and need a cost-efficient data warehouse or data engine, check out <a href="https://app.motherduck.com" target="_blank" rel="noopener noreffer">MotherDuck</a> for free.</p>
<p>Further in-depth content can be found and learned through bootcamps, events, and courses. Please don&rsquo;t give up; it&rsquo;s a lot to take in when you start. Begin with the fundamentals as guided in this roadmap, and also follow your interests. It&rsquo;s better to learn something that might not be suitable right now, but because you are passionate about it, learning comes much more easily. And over time, that knowledge may be put to use at a crucial moment later on.</p>
<hr>
<pre class=""><em>Full article published at <a href="https://motherduck.com/blog/summer-data-engineering-roadmap/" target="_blank" rel="noopener noreferrer">MotherDuck.com</a> - written as part of <a href="/services">my services</a></em></pre>
]]></description>
</item>
<item>
    <title>The Data Engineering Toolkit: Infrastructure, DevOps, and Beyond</title>
    <link>https://www.ssp.sh/blog/data-engineering-toolkit-devops-iac/</link>
    <pubDate>Thu, 10 Jul 2025 10:34:08 &#43;0200</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/data-engineering-toolkit-devops-iac/</guid><enclosure url="https://www.ssp.sh/blog/data-engineering-toolkit-devops-iac/featured-image.jpeg" type="image/jpeg" length="0" /><description><![CDATA[<p>Remember when data scientists spent 80% of their time wrestling with data wrangling instead of building models?</p>
<p>I&rsquo;d argue that today&rsquo;s data engineers face similar challenges, but with the added complexity of infrastructure setup. We&rsquo;re architects of entire data ecosystems, orchestrating everything from real-time pipelines to AI workflows. The secret? Infrastructure as Code and DevOps principles that transform scattered server management into elegant, declarative configurations.</p>
<p>The catch is that while abstractions have made complex deployments more accessible, the toolkit has exploded in scope. One day, you&rsquo;re optimizing SQL queries, the next, you&rsquo;re debugging Kubernetes deployments, and by afternoon, you&rsquo;ll be explaining data quality metrics to stakeholders who just want to know why their dashboard is empty.</p>
<p>This is Part 2 of my in-depth exploration of the modern data engineer&rsquo;s toolkit. While <a href="/blog/data-engineering-toolkit" rel="">Part 1</a> covered the fundamentals of your development environment, programming languages, and core productivity tools, this essay addresses the more advanced technologies—such as data processing, infrastructure, data quality, and observability—required to transform data pipelines into production-grade data platforms.</p>
<p>We&rsquo;ll explore everything from SQL engines and workflow orchestration that form your daily toolkit to DevOps practices that make your deployments bulletproof, and the advanced utility tools that help you sleep better at night. Additionally, we&rsquo;ll explore the soft skills that can make the difference between a data engineer and a data engineering leader.</p>
<div class="details admonition warning open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-warning"></i>Disclaimer<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">Not every tool link or tool here belongs in every data engineer&rsquo;s toolkit. Your domain, company size, and tech stack will heavily influence what matters most for you. This is a curated collection from years of working in the field, covering the wide range of what you might use at some point.</div>
        </div>
    </div>
<h2 id="data-processing-and-analytics">Data Processing and Analytics</h2>
<p>Continuing from the developer productivity and data engineering programming languages discussed in Part I, we have data processing and analytics technologies that are at the core of data engineering. SQL, relational databases, and BI tools are the bread and butter of everyday work, and Python is the glue language that ties everything together.</p>
<p>But most of the time, we must also set up a project that connects all the dots through orchestration, whether it&rsquo;s a simple cron job or Python script.</p>
<h3 id="sql-and-databases">SQL and Databases</h3>
<p>SQL is the <strong>language of data</strong>. SQL is a <strong>fundamental skill</strong> for doing any data work. There&rsquo;s almost nothing you do without needing SQL. If you work with a REST API with no direct SQL interface, it&rsquo;s still beneficial to know, as the REST service will most certainly perform a SQL query against the database based on your REST request.</p>
<p>With that said, what SQL engines and databases do data engineers use?</p>
<p>The most common <strong>relational databases</strong> are also called <a href="https://en.wikipedia.org/wiki/Online_transaction_processing" target="_blank" rel="noopener noreffer">OLTP</a> databases:</p>
<ul>
<li><strong><a href="https://github.com/sqlite/sqlite" target="_blank" rel="noopener noreffer">SQLite</a></strong>: A single-file database that is very handy for web development or when you need a database that can go with the code to avoid long latency for network or fetching and pushing data.</li>
<li><strong><a href="https://github.com/postgres/postgres" target="_blank" rel="noopener noreffer">Postgres</a></strong>: Perfect for any transactional and smallish data, but also scales up relatively high.</li>
<li><strong><a href="https://www.mysql.com/" target="_blank" rel="noopener noreffer">MySQL</a></strong> / <strong><a href="https://github.com/MariaDB/server" target="_blank" rel="noopener noreffer">MariaDB</a></strong>: Wide adoption before Postgres, good performance. MariaDB forked from MySQL around the Oracle acquisition of MySQL (acquired through the Sun purchase).</li>
</ul>
<p>Analytical databases that speak SQL - also called <a href="https://en.wikipedia.org/wiki/Online_analytical_processing" target="_blank" rel="noopener noreffer">OLAP</a> - are optimized for fast query responses:</p>
<ul>
<li><strong><a href="https://github.com/duckdb/duckdb" target="_blank" rel="noopener noreffer">DuckDB</a></strong>: A single-file OLAP database, optimized for analytical queries.</li>
<li><strong><a href="https://motherduck.com/" target="_blank" rel="noopener noreffer">MotherDuck</a></strong>: Scaled out DuckDB in the cloud, DWH in minutes.</li>
<li><strong><a href="https://github.com/ClickHouse/ClickHouse" target="_blank" rel="noopener noreffer">ClickHouse</a></strong>: A fast analytical (OLAP) database.</li>
<li><strong><a href="https://github.com/StarRocks/starrocks" target="_blank" rel="noopener noreffer">StarRocks</a></strong>: A newer fast analytical database, focusing on making data-intensive real-time analytics easy.</li>
<li>Cloud Data Warehouses: <strong>Snowflake</strong>, <strong>BigQuery</strong>, <strong>Redshift</strong>, <strong>Azure Fabric</strong></li>
</ul>
<p>Database utilities that help us with both:</p>
<ul>
<li><strong><a href="https://github.com/duckdb/pg_duckdb" target="_blank" rel="noopener noreffer">pg_duckdb</a></strong>: small library and plugin to make Postgres work with DuckDB, mainly extending Postgres with analytical features.</li>
<li><strong><a href="https://en.wikipedia.org/wiki/Java_Database_Connectivity" target="_blank" rel="noopener noreffer">JDBC</a></strong> / <strong><a href="https://en.wikipedia.org/wiki/Open_Database_Connectivity" target="_blank" rel="noopener noreffer">ODBC</a></strong> and newer versions <strong><a href="https://github.com/apache/arrow-adbc" target="_blank" rel="noopener noreffer">Arrow-ADBC</a></strong>.</li>
<li><strong><a href="https://github.com/apache/calcite" target="_blank" rel="noopener noreffer">Apache Calcite</a></strong>: SQL parser and query optimization framework</li>
</ul>
<h3 id="python-processing-tools">Python Processing Tools</h3>
<p>Python, on the other hand, is the ultimate toolkit language. Pulling data from a REST API or web, cleaning out some insufficient data, and storing it in Postgres. How would you do that in a safe, ordered fashion? Right, Python.  It allows you to easily reach an API and automate tasks that Bash can&rsquo;t handle.</p>
<p>Besides the <a href="https://motherduck.com/blog/data-engineering-toolkit-essential-tools#python-libraries" target="_blank" rel="noopener noreffer">generic Python libraries</a> in Part 1, here are Python data processing libraries, potentially lesser-known, and suitable for advanced use-cases:</p>
<ul>
<li><strong><a href="https://github.com/ibis-project/ibis" target="_blank" rel="noopener noreffer">Ibis</a></strong>: It provides a lightweight, universal interface for data wrangling, helping explore and transform data of any size, stored anywhere.</li>
<li><strong><a href="https://github.com/dask/dask" target="_blank" rel="noopener noreffer">Dask</a></strong>: A flexible library for parallel computing. Dask scales Python code from multi-core local machines to large distributed clusters in the cloud.</li>
<li><strong><a href="https://github.com/sfu-db/connector-x" target="_blank" rel="noopener noreffer">ConnectorX</a></strong>: The fastest library to load data from the database to DataFrames.</li>
<li><strong><a href="https://modal.com/docs" target="_blank" rel="noopener noreffer">Modal</a></strong>: A cloud function platform that lets you run any code remotely within seconds.</li>
<li><strong><a href="https://github.com/erezsh/reladiff" target="_blank" rel="noopener noreffer">reladiff</a></strong> (formerly <a href="https://github.com/datafold/data-diff" target="_blank" rel="noopener noreffer">data-diff</a> by Datafold): Tool to efficiently diff rows across databases</li>
<li><strong><a href="https://marsupialtail.github.io/quokka/" target="_blank" rel="noopener noreffer">Quokka</a></strong>: An open-source push-based vectorized query engine.</li>
<li><strong><a href="https://github.com/vaexio/vaex" target="_blank" rel="noopener noreffer">Vaex</a></strong>: High-performance library for lazy out-of-core DataFrames, to visualize and explore big tabular datasets.</li>
<li><strong><a href="https://github.com/xorq-labs/xorq" target="_blank" rel="noopener noreffer">Xorq</a></strong>: A declarative framework for building multi-engine computations.</li>
<li><strong><a href="https://github.com/burnash/gspread" target="_blank" rel="noopener noreffer">gspread</a></strong>: Work with Google Sheets through Python API, or <a href="https://duckdb.org/community_extensions/extensions/gsheets.html" target="_blank" rel="noopener noreffer">with DuckDB</a>.</li>
</ul>
<p>Want more? Check out the <a href="https://github.com/vinta/awesome-python" target="_blank" rel="noopener noreffer">Awesome Python List</a> with thousands of more frameworks, libraries, software, and resources.</p>
<h3 id="workflow-orchestration-platforms">Workflow Orchestration Platforms</h3>
<p>A key tool, often used within Python, are data orchestrators. These orchestrate the workflow of data processes in certain needed steps.</p>
<p>These are typically in Python, such as <strong><a href="https://github.com/apache/airflow" target="_blank" rel="noopener noreffer">Apache Airflow</a></strong>, <strong><a href="https://github.com/dagster-io/dagster" target="_blank" rel="noopener noreffer">Dagster</a></strong>, <strong><a href="https://github.com/PrefectHQ/prefect" target="_blank" rel="noopener noreffer">Prefect</a></strong>. But there are also others, such as <strong><a href="https://github.com/temporalio/temporal" target="_blank" rel="noopener noreffer">Temporal</a></strong>, <strong><a href="https://github.com/kestra-io/kestra" target="_blank" rel="noopener noreffer">Kestra</a></strong>, <strong><a href="https://github.com/mage-ai/mage-ai" target="_blank" rel="noopener noreffer">Mage</a></strong>, <strong><a href="https://github.com/argoproj/argo-workflows" target="_blank" rel="noopener noreffer">Argo Workflows</a></strong>, <strong><a href="https://github.com/flyteorg/flyte" target="_blank" rel="noopener noreffer">Flyte</a></strong>, and many more.</p>
<h3 id="analytics-and-bi">Analytics and BI</h3>
<p>Besides relational databases, SQL, and Python, in all cases, you want to present the data to your users or stakeholders. This is where BI tools, Notebooks, and data apps for visualization come into play.</p>
<p>There&rsquo;s <a href="https://github.com/thenaturalist/awesome-business-intelligence" target="_blank" rel="noopener noreffer">plenty out there</a>, but here are the major ones and my favorites:</p>
<ul>
<li><strong><a href="https://github.com/apache/superset" target="_blank" rel="noopener noreffer">Apache Superset</a></strong>: Original open-source BI tool.</li>
<li><strong><a href="https://github.com/rilldata/rill" target="_blank" rel="noopener noreffer">Rill</a></strong>: Open-source and BI-as-Code platform.</li>
<li><strong><a href="https://www.microsoft.com/en-us/power-platform/products/power-bi" target="_blank" rel="noopener noreffer">PowerBI</a></strong>: Microsoft&rsquo;s business intelligence platform.</li>
<li><strong><a href="https://omni.co/" target="_blank" rel="noopener noreffer">Omni</a></strong>: Business intelligence platform that helps companies explore data with a point-and-click UI, spreadsheets, AI, or SQL.</li>
<li><strong><a href="https://www.sigmacomputing.com/" target="_blank" rel="noopener noreffer">Sigma Computing</a></strong>: Next-generation analytics and business intelligence platform with SQL in a familiar spreadsheet interface.</li>
<li><strong><a href="https://github.com/lightdash/lightdash" target="_blank" rel="noopener noreffer">Lightdash</a></strong>: Instantly turn your dbt project into a full-stack BI platform.</li>
<li><strong><a href="https://www.tableau.com/business-intelligence" target="_blank" rel="noopener noreffer">Tableau</a></strong>: An Enterprise BI tool that has existed for a long time, with powerful ETL and other features.</li>
<li><strong><a href="https://www.targit.com/" target="_blank" rel="noopener noreffer">TARGIT</a></strong>: Enterprise BI solution specializing in industry-specific implementations in the Nordics.</li>
</ul>
<p>Beyond BI tools, there are also notebooks:</p>
<ul>
<li><strong><a href="https://github.com/jupyter/notebook" target="_blank" rel="noopener noreffer">Jupyter Notebook</a></strong> / <strong><a href="https://zeppelin.apache.org/" target="_blank" rel="noopener noreffer">Zeppelin</a></strong>, <strong><a href="https://github.com/marimo-team/marimo" target="_blank" rel="noopener noreffer">Marimo</a></strong>: Open-Source notebooks</li>
<li><strong><a href="https://hex.tech/" target="_blank" rel="noopener noreffer">Hex</a></strong>, <strong><a href="https://deepnote.com/" target="_blank" rel="noopener noreffer">Deepnote</a></strong>, <strong><a href="https://motherduck.com/docs/getting-started/interfaces/motherduck-quick-tour/" target="_blank" rel="noopener noreffer">MotherDuck Notebook</a></strong>: Closed-source</li>
<li>More exotic ones: <strong><a href="https://count.co/" target="_blank" rel="noopener noreffer">Count</a></strong> (canva style), <strong><a href="https://www.quadratichq.com/" target="_blank" rel="noopener noreffer">Quadratic</a></strong> (spreadsheet style), <strong><a href="https://www.notboring.co/p/excel-never-dies" target="_blank" rel="noopener noreffer">Excel</a></strong> (mother of BI tools)</li>
</ul>
<h2 id="devops-and-infrastructure">DevOps and Infrastructure</h2>
<p>Once you have a setup with integration, orchestration, and visualization, you usually need to scale it or deploy it to internal cloud servers or one of the major cloud providers. You typically use something more than plain Docker Compose or a quick <a href="https://github.com/astral-sh/uv" target="_blank" rel="noopener noreffer"><code>uv init</code></a> for setting up all relevant Python settings. Usually, it involves Kubernetes, <a href="https://github.com/hashicorp/terraform" target="_blank" rel="noopener noreffer">Terraform</a>, or Infrastructure as Code.</p>
<p>Either you pay for a service to do that for you, or if you have chosen a set of open-source tools, you mostly end up doing it yourself.</p>
<p>Popular frameworks, such as Terraform, Helm, and Ansible, as well as other scripts, can be deployed on any cloud. Typically, a Kubernetes cluster is used to deploy them. It&rsquo;s the de facto standard for cloud-agnostic deployment and works well for data engineering projects, as you declaratively define the state you&rsquo;d like to have for your data platform. Kubernetes matches that and ramps the right amount of server, CPU, memory, etc., to make it runnable and scalable on any cloud.</p>
<p>Most of the time, it includes setting up an automated CI/CD pipeline that handles automated testing, deployment, version control, and all the software engineering best practices for data engineering.</p>
<h3 id="infrastructure-as-code-gitops-and-dataops">Infrastructure as Code, GitOps, and DataOps</h3>
<p>DevOps has become a bigger part of data engineers&rsquo; work in most scenarios in recent years, making deployment of every updated OSS tool straightforward, easy to test, and reproducible.</p>
<p>Making the data stack <strong>modular</strong> so that additional tools can be added with a clearly defined path for integration, such as metadata, logging at the same place, and security, so user permissions can be given to existing users without needing to re-create users every single time. This usually involves integration with <a href="https://github.com/keycloak/keycloak" target="_blank" rel="noopener noreffer">Keycloak</a>, <a href="https://www.okta.com/" target="_blank" rel="noopener noreffer">Okta</a>, or <a href="https://auth0.com/" target="_blank" rel="noopener noreffer">Auth0</a>. A good example of such an integrated data stack is <a href="https://github.com/kanton-bern/hellodata-be" target="_blank" rel="noopener noreffer">HelloData</a>, but there are more—see <a href="https://sh.reddit.com/r/dataengineering/comments/1g50jwi/should_we_use_a_declarative_data_stack/" target="_blank" rel="noopener noreffer">declarative data stacks</a>.</p>
<p>But why would you invest all this energy and effort to have something run on Kubernetes? Besides the declarative approach mentioned, which is more robust than <a href="https://www.ssp.sh/brain/imperative/" target="_blank" rel="noopener noreffer">imperative</a> approaches that tend to break down more often, especially for large projects, Kubernetes has significant advantages. The DevOps-style deployment fosters a culture of collaboration and shared responsibility through configuration YAML files checked into a git repo, which is pivotal for how data teams can work with an efficient workflow and increase productivity.</p>
<div class="details admonition tip open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-tip"></i>Why YAML for DevOps: Descriptive configs<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">There are a few advantages to using YAML files. The changes become more structured as we have a <strong>straightforward</strong> interface for each fix, which makes them more <strong>maintainable</strong>. They are easy to read, modify, and incrementally test instead of loose SQL files that are very complex. They are easily portable between programming languages.</div>
        </div>
    </div>
<p>This way of working is called Infrastructure as Code, or <a href="https://kestra.io/blogs/2024-02-06-gitops" target="_blank" rel="noopener noreffer">GitOps</a>, and is strongly related to <a href="https://en.wikipedia.org/wiki/DataOps" target="_blank" rel="noopener noreffer">DataOps</a>. So, what are the toolkits for DevOps, you might ask?</p>
<p><strong>Container &amp; Orchestration</strong>:</p>
<ul>
<li><strong><a href="https://kubernetes.io/" target="_blank" rel="noopener noreffer">Kubernetes</a></strong> (k8s): De facto standard for container orchestration that provides scalable, cloud-agnostic deployment with declarative infrastructure management.
<ul>
<li><strong><a href="https://www.redhat.com/en/technologies/cloud-computing/openshift" target="_blank" rel="noopener noreffer">Red Hat OpenShift</a></strong>: Enterprise Kubernetes platform with integrated developer tools, security features, and multi-cloud capabilities.</li>
<li><strong><a href="https://kubernetes.io/docs/reference/kubectl/" target="_blank" rel="noopener noreffer">kubectl</a></strong>: Command-line tool for managing Kubernetes clusters and debugging containerized data pipelines</li>
<li><strong><a href="https://github.com/kubernetes-sigs/kustomize/" target="_blank" rel="noopener noreffer">Kustomize</a></strong>: Configuration management tool for Kubernetes that allows environment-specific customizations without template complexity</li>
</ul>
</li>
<li><strong><a href="https://helm.sh/" target="_blank" rel="noopener noreffer">Helm</a></strong>: Package manager for Kubernetes that simplifies the deployment of complex data stack applications with reusable charts</li>
<li><strong><a href="https://www.docker.com/" target="_blank" rel="noopener noreffer">Docker</a></strong>: A containerization platform that ensures consistent environments across development, testing, and production for data engineering workloads</li>
</ul>
<p><strong>Infrastructure as Code (IaC)</strong>:</p>
<ul>
<li><strong><a href="https://github.com/hashicorp/terraform" target="_blank" rel="noopener noreffer">Terraform</a></strong>: A multi-cloud infrastructure provisioning tool that enables versioned, reproducible cloud resource management for data platforms</li>
<li><strong><a href="https://github.com/pulumi/pulumi" target="_blank" rel="noopener noreffer">Pulumi</a></strong>: Modern IaC platform supporting multiple programming languages for infrastructure definition with strong typing and testing capabilities</li>
<li><strong><a href="https://www.ansible.com/" target="_blank" rel="noopener noreffer">Ansible</a></strong>: A configuration management and automation tool that handles server provisioning, application deployment, and system administration tasks</li>
<li><strong><a href="https://koreo.dev/" target="_blank" rel="noopener noreffer">Koreo</a></strong>: A new approach to Kubernetes configuration management and resource orchestration, empowering developers through programmable workflows and structured data</li>
</ul>
<p><strong>GitOps &amp; CD Tools</strong>:</p>
<ul>
<li><strong><a href="https://github.com/argoproj/argo-cd" target="_blank" rel="noopener noreffer">ArgoCD</a></strong>: Declarative GitOps continuous delivery tool for Kubernetes that automatically syncs cluster state with Git repositories</li>
<li><strong><a href="https://github.com/fluxcd/flux2" target="_blank" rel="noopener noreffer">Flux</a></strong>: GitOps toolkit for keeping Kubernetes clusters synchronized with Git repository configurations using pull-based deployment</li>
<li><strong><a href="https://octopus.com/" target="_blank" rel="noopener noreffer">Octopus Deploy</a></strong>: Advanced deployment automation platform for complex multi-environment releases with approval workflows</li>
</ul>
<p><strong>CI/CD Platforms</strong>:</p>
<ul>
<li><strong><a href="https://docs.github.com/en/actions" target="_blank" rel="noopener noreffer">GitHub Actions</a></strong>: Native GitHub CI/CD platform with an extensive marketplace ecosystem for automated testing and deployment workflows</li>
<li><strong><a href="https://docs.gitlab.com/ee/ci/" target="_blank" rel="noopener noreffer">GitLab CI/CD</a></strong>: Integrated DevOps platform providing end-to-end automation from code to deployment with built-in security scanning</li>
<li><strong><a href="https://www.jenkins.io/" target="_blank" rel="noopener noreffer">Jenkins</a></strong>: Open-source automation server with controller/agent architecture ideal for complex, customizable build and deployment pipelines</li>
<li><strong><a href="https://circleci.com/" target="_blank" rel="noopener noreffer">CircleCI</a></strong>: Cloud-native CI/CD platform known for fast build times and Docker-first approach to testing data engineering workflows</li>
<li><strong><a href="https://www.atlassian.com/software/bamboo" target="_blank" rel="noopener noreffer">Bamboo</a></strong>: Atlassian&rsquo;s CI/CD tool with tight integration to Jira and Bitbucket for teams already using the Atlassian ecosystem</li>
</ul>
<p><strong>Security &amp; Secrets Management</strong>:</p>
<ul>
<li><strong><a href="https://github.com/getsops/sops" target="_blank" rel="noopener noreffer">SOPS</a></strong>: Encrypted secrets management tool that works with PGP/age keys to secure sensitive configuration data in Git repositories</li>
<li><strong><a href="https://github.com/hashicorp/vault" target="_blank" rel="noopener noreffer">HashiCorp Vault</a></strong>: A dynamic secrets management system for secure storage and access to tokens, passwords, and certificates</li>
</ul>
<p>There&rsquo;s a lot more, but these are some of the first tools you will encounter if you start scaling out your data platform on Kubernetes and use modern DevOps practices to build a data engineering platform that is maintainable and scalable, ensuring reproducible deployments and efficient collaboration across development and operations teams.</p>
<div class="details admonition warning open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-warning"></i>Setting up GitOps is hard and shouldn&#39;t be underestimated<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">Setting up GitOps for new data projects is hard and best done by having a central team specializing in deployment and operations with deep knowledge. These teams help you deploy a new data engineering tool in days rather than weeks. Data engineers and other personnel can then focus on their core workload.</div>
        </div>
    </div>
<div class="details admonition tip open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-tip"></i>TUIs for more efficiency (and fun!)<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">If you use git or Docker frequently, please check out <a href="https://github.com/jesseduffield/lazygit" target="_blank" rel="noopener noreffer">Lazygit</a>, <a href="https://github.com/jesseduffield/lazydocker" target="_blank" rel="noopener noreffer">Lazydocker</a>, and <a href="https://github.com/derailed/k9s" target="_blank" rel="noopener noreffer">k9s</a>. These are TUIs that show you all commands within a single command. Instead of remembering or typing long commands, you can just use a graphical user interface in the terminal and navigate with the keyboard.</div>
        </div>
    </div>
<h3 id="devops-abstraction-levels">DevOps Abstraction Levels</h3>
<p>What are the alternatives to DevOps?</p>
<p>DevOps isn&rsquo;t binary; it&rsquo;s about selecting the appropriate level of control and abstraction for your specific needs. You&rsquo;re still practicing DevOps, whether you&rsquo;re managing Kubernetes clusters or deploying serverless functions; you&rsquo;re just operating at different levels of abstraction.</p>
<p><strong>Serverless and Managed Services</strong> represent the highest abstraction level, where you focus purely on your data logic while the platform handles infrastructure concerns. Tools like AWS Lambda, Google Cloud Functions, and managed data warehouses let you deploy code and query data without worrying about servers, scaling, or maintenance. Your application remains portable, with core business logic that can typically be moved between providers, but you trade some customization for operational simplicity.</p>
<p><strong>Container-as-a-Service (CaaS)</strong> platforms, such as Google Cloud Run, AWS Fargate, or Azure Container Instances, offer a middle ground. You containerize your applications (maintaining portability) but delegate orchestration complexity to the platform. You still get the benefits of DevOps practices—version control, automated deployments, Infrastructure as Code—without managing the underlying infrastructure.</p>
<p><strong>Managed Kubernetes</strong> services, such as Google GKE, Azure AKS, and AWS EKS, provide another abstraction layer, offering full Kubernetes capabilities without requiring control plane management. This bridges the gap between complete infrastructure control and operational simplicity.</p>
<p>The key is matching your abstraction level to your team&rsquo;s expertise and requirements. Start with higher abstraction levels for faster delivery, then move toward more control only when specific customizations become necessary.</p>
<h2 id="data-quality-and-observability">Data Quality and Observability</h2>
<p>As the data platform becomes more complex and features additional tools, it becomes increasingly sensible to have a data quality or observability stack—tools to have an automated overview of the <strong>health of your data platform</strong>.</p>
<p>Below are some of the standard tools (without getting too lengthy) that we haven&rsquo;t covered and were not mentioned in Part 1:</p>
<ul>
<li><strong><a href="https://www.elastic.co/elastic-stack" target="_blank" rel="noopener noreffer">ELK Stack</a></strong>: Elasticsearch, Kibana, and Logstash. Reliably and securely take data from any source, in any format, then search, analyze, and visualize.</li>
<li><strong><a href="https://github.com/prometheus/prometheus" target="_blank" rel="noopener noreffer">Prometheus</a></strong>: Open-source monitoring system and time series database.</li>
<li><strong><a href="https://www.datadoghq.com/" target="_blank" rel="noopener noreffer">DataDog + Metaplane</a></strong>: Monitoring and security platform for developers, IT operations teams, and business users in the cloud age. DataDog recently acquired <a href="https://www.metaplane.dev/" target="_blank" rel="noopener noreffer">Metaplane</a>, an end-to-end data observability platform that catches silent data quality issues before they impact your business.</li>
<li><strong><a href="https://www.datafold.com/data-quality-monitoring" target="_blank" rel="noopener noreffer">Datafold</a></strong>: Comprehensive data monitoring to prevent downtime and detect data quality issues early.</li>
<li><strong><a href="https://www.soda.io/" target="_blank" rel="noopener noreffer">Soda</a></strong>: Soda is a data quality testing solution, with parts of it <a href="https://github.com/sodadata/soda-core" target="_blank" rel="noopener noreffer">open-source</a>, like data quality testing for the modern data stack (SQL, Spark, and Pandas).</li>
<li><strong><a href="https://www.montecarlodata.com/" target="_blank" rel="noopener noreffer">Monte Carlo</a></strong>: Enterprise-ready with extensive data lake integrations</li>
<li><strong><a href="https://www.bigeye.com/" target="_blank" rel="noopener noreffer">Bigeye</a></strong>: ML-driven automatic threshold tests and alerts</li>
</ul>
<h2 id="ai-enhanced-workflow-development">AI-Enhanced Workflow Development</h2>
<p>New AI-enhanced tools with LLMs or MCPs are being invented and are already useful today.</p>
<p>For example, for data engineers, there are dedicated IDEs or integrations into MCPs—especially <strong>agentic workflows</strong>:</p>
<ul>
<li><strong><a href="https://getnao.io/" target="_blank" rel="noopener noreffer">nao</a></strong>: An AI-enhanced editor specifically for data engineers. In its early days, it understands dbt and can create and run pipelines.</li>
<li><strong><a href="https://github.com/motherduckdb/mcp-server-motherduck" target="_blank" rel="noopener noreffer">MCP server for DuckDB and MotherDuck</a></strong>: Makes your editor autonomously query the underlying database on the fly.</li>
<li><strong><a href="https://docs.anthropic.com/en/docs/claude-code/overview" target="_blank" rel="noopener noreffer">Claude Code</a></strong>: An agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster through natural language commands.</li>
<li><strong><a href="https://github.com/dbt-labs/dbt-mcp" target="_blank" rel="noopener noreffer">dbt MCP</a></strong>: A MCP server provides tools to interact with dbt autonomously, like running dbt build or docs, etc.</li>
<li><strong><a href="https://docs.rilldata.com/explore/mcp" target="_blank" rel="noopener noreffer">Rill MCP Server</a></strong>: Exposes Rill&rsquo;s most essential APIs to LLMs. It is currently designed primarily for data analysts.</li>
</ul>
<p>Also, check out <a href="https://www.youtube.com/watch?v=yG1mv8ZRxcU&amp;t=1s" target="_blank" rel="noopener noreffer">Faster Data Pipeline Development with MCP and DuckDB</a>, which explains MCP in more detail and directly showcases some of the use cases.</p>
<div class="details admonition info open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-info"></i>Agentic Workflows vs. AI Agents<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">One word on the distinction between an agent and a workflow. Anthropic defines <em>Workflows</em> as systems where Large Language Models and tools are orchestrated through predefined code paths. And <em>Agents</em> are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.</div>
        </div>
    </div>
<h2 id="soft-skill-communication-business-requirements">Soft Skill: Communication, Business Requirements</h2>
<p>As AI workflows reduce the need for coding, business acumen and soft skills become even more crucial. This section focuses on the human aspect of communication within the organization or among team members, and gathering the right <strong>business requirements</strong> before developing a platform or solution that may not be needed in the first place.</p>
<h3 id="essential-soft-skills">Essential Soft Skills</h3>
<p>Business <strong>understanding</strong> is crucial for practical data engineering. This means being genuinely interested in business nuances, actively listening to domain experts, and developing strong <strong>communication</strong> skills for requirements engineering, which significantly overlaps with traditional BI engineering roles.</p>
<p>Cross-functional <strong>collaboration</strong> is equally important. Data engineers must translate technical constraints and possibilities into business terms for stakeholders, while also understanding their pain points and priorities. This includes stakeholder management, <strong>documentation skills</strong>, and the ability to ask the right questions to uncover hidden requirements and assumptions.</p>
<p>While you can be a technical expert without these skills, combining technical expertise with strong business understanding and communication will set you apart. It helps you solve real business problems and deliver measurable value—something we should always keep in mind.</p>
<h2 id="building-your-data-engineering-toolkit">Building Your Data Engineering Toolkit</h2>
<p>Wrapping up these two articles on the in-depth toolkit for data engineers, I hope you&rsquo;ve learned a tool or two that will improve your workflow as a data engineer or in the data field.</p>
<p>Hopefully, you won&rsquo;t be overwhelmed by all the links. Again, it&rsquo;s not meant to be a toolkit for everyone, but instead provides pointers for the direction you&rsquo;d like to explore when starting or when you want to venture into a slightly different area of data engineering.</p>
<p>We&rsquo;ve gone from fundamental to advanced DevOps skills and learned along the way:</p>
<ul>
<li>Developer tools and programming languages in Part 1 and the sophisticated ecosystem of modern data engineering in Part 2.</li>
<li>SQL databases and Python as your foundational toolkit, with analytics and BI platforms for presenting insights.</li>
<li>DevOps and Infrastructure as Code for scalable deployments with Kubernetes.</li>
<li>Data quality and observability solutions for maintaining platform health.</li>
<li>Emerging AI-enhanced workflows that are reshaping how we build data pipelines.</li>
<li>Technical expertise alone isn&rsquo;t always enough; strong communication skills and business understanding transform data engineers into 10x contributors, delivering real value to the business.</li>
</ul>
<p>If you want to learn more tips and tricks about the toolset, please follow the <a href="https://motherduck.com/duckdb-news/" target="_blank" rel="noopener noreffer">MotherDuck newsletter</a> for the latest news about DuckDB, which usually contains great insights and tools for working with data through DuckDB or MotherDuck. You can also <a href="https://app.motherduck.com/" target="_blank" rel="noopener noreffer">try MotherDuck</a>, which allows you to handle many data use cases in a notebook environment with many of the tools mentioned in these articles.</p>
<p>If you have a toolkit you use every day as a data engineer or a unique tool that cannot be found in the two parts, please let me know on social media in the comments. I&rsquo;d be happy to know what you use as your core toolkit for everyday work.</p>
<hr>
<pre class=""><em>Full article published at <a href="https://motherduck.com/blog/data-engineering-toolkit-infrastructure-devops/" target="_blank" rel="noopener noreferrer">MotherDuck.com</a> - written as part of <a href="/services">my services</a></em></pre>
]]></description>
</item>
<item>
    <title>The Data Engineering Toolkit: Essential Tools for Your Machine</title>
    <link>https://www.ssp.sh/blog/data-engineering-toolkit/</link>
    <pubDate>Wed, 22 Jan 2025 17:34:08 &#43;0200</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/data-engineering-toolkit/</guid><enclosure url="https://www.ssp.sh/blog/data-engineering-toolkit/featured-image.png" type="image/png" length="0" /><description><![CDATA[<p>To be proficient as a data engineer, you need to know various toolkits—from fundamental Linux commands to different virtual environments and optimizing efficiency as a data engineer.</p>
<p>This article focuses on the building blocks of data engineering work, such as operating systems, development environments, and essential tools. We&rsquo;ll start from the ground up—exploring crucial Linux commands, containerization with Docker, and the development environments that make modern data engineering possible. We look at current programming languages and how they influence our work—providing a comprehensive overview of the tools of a modern data engineer.</p>
<hr>
<p>Before we start, you don&rsquo;t need to know everything discussed here, but over time, you may use all of them in various roles as a data engineer at different companies. I hope this article will give you a good overview and guidelines on what is essential and what is not.</p>
<p>Again, each selection might differ slightly depending on the company&rsquo;s setup, preferred vendors, and whether it uses a low-code or a building approach. Let&rsquo;s start with the first choice you must make at any company, the operation system to work on.</p>
<h2 id="operating-systems--environment">Operating Systems &amp; Environment</h2>
<p>Before starting as a data engineer, your laptop, operating system (OS), and environment are your first choices. Here, we discuss the different OSs and virtualization you will encounter, such as Docker and ENV variables, to configure different environments.</p>
<h3 id="operating-system-choices-windowsmaclinux">Operating System Choices (Windows/Mac/Linux)</h3>
<p>Choosing the right operating system might seem significant. Primarily, it&rsquo;s a preference for what you like and know. Still, there is the fact that most <strong>data platforms</strong> that run on a server will run on a Linux-based OS system. Working on Linux OS on the client might give you skills you can reuse, but you can also have that with Windows with <a href="https://en.wikipedia.org/wiki/Windows_Subsystem_for_Linux" target="_blank" rel="noopener noreffer">WSL</a><sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> and MacOS running a Darwin-based Linux.</p>
<p>Your employer also defines it. If you are a Microsoft shop, you use tools such as Power BI, Visual Studio (not Visual Studio Code), and C#. This requires using Windows or at least a VM with Windows.</p>
<p>If you work at a startup and need great hardware that is easy to use, the company will probably provide you with the latest MacBook with MacOS installed. However, if you are a power user or need your <a href="https://www.freecodecamp.org/news/dotfiles-what-is-a-dot-file-and-how-to-create-it-in-mac-and-linux/" target="_blank" rel="noopener noreffer">Dotfiles</a>, you may not use anything other than a Linux-based operating system. We will look later at fundamental Linux commands that make the life of every data engineer easier.</p>
<h3 id="virtual-machine-vm">Virtual Machine (VM)</h3>
<p>As mentioned, you could run MacOS and Windows in a VM with VMware or Parallels. These are not native installations, but close to it, and they allow you to do most things.</p>
<p>The same goes if you are on Windows; instead of using WSL, which sometimes can get tricky with companies&rsquo; proxies and network routing, you could use a Linux VM locally or somewhere hosted that you just SSH into or an <a href="https://www.youtube.com/live/LA8KF9Fs2sk?si=_nQRGKJIa_NlFHn2&amp;t=1072" target="_blank" rel="noopener noreffer">advanced example with Nix</a>. There are other solutions to explore; e.g., your whole machine could be a VM provided by your company or deploy a <a href="https://code.visualstudio.com/docs/remote/vscode-server" target="_blank" rel="noopener noreffer">VS Code server</a> to run VS Code instances inside your company network.</p>
<h3 id="env-variables">ENV variables</h3>
<p>The next layer that you commonly use is ENV variables. This is already a little more advanced. But think of your reproducible environments with your co-workers or managing different environments (dev/staging/prod) instead of hard copying all settings, which won&rsquo;t work on other environments with different OS or other expectations.</p>
<p>If you type <code>env</code> in a Linux-based OS terminal, you can see all your local env sets. To illustrate some, I have set these ENVs:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">❯ env
</span></span><span class="line"><span class="cl"><span class="nv">AIRFLOW_HOME</span><span class="o">=</span>~/.airflow
</span></span><span class="line"><span class="cl"><span class="nv">SPARK_HOME</span><span class="o">=</span>~/Documents/spark/spark-3.5.1-bin-hadoop3.3
</span></span><span class="line"><span class="cl"><span class="nv">MINIO_ENDPOINT</span><span class="o">=</span>http://127.0.0.1:9000
</span></span><span class="line"><span class="cl"><span class="nv">GITHUB_USER</span><span class="o">=</span>sspaeti
</span></span><span class="line"><span class="cl"><span class="nv">AWS_SECRET_ACCESS_KEY</span><span class="o">=</span>my-secure-key
</span></span><span class="line"><span class="cl"><span class="nv">AWS_ACCESS_KEY_ID</span><span class="o">=</span>my-access-key
</span></span></code></pre></td></tr></table>
</div>
</div><p>These can be set in a projects-repositories folder, usually in <code>.env</code>, and which will be picked up automatically. However, the recommended approach is using SSO CLI tools (like <code>aws sso login</code> or <code>gcloud auth login</code>), which will automatically populate credentials in the expected locations, or alternatively adding them to your shell config (<code>~/.bashrc</code>, <code>~/.zshrc</code>).</p>
<div class="details admonition info open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-info"></i>Never commit `.env` files to version control<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">They often contain sensitive credentials. Add <code>.env</code> to your <code>.gitignore</code> file. Instead, provide an example file like <code>.env.example</code> with dummy values.</div>
        </div>
    </div>
<h3 id="docker-and-container-images">Docker and Container Images</h3>
<p>Another virtualized environment is <a href="https://www.docker.com/" target="_blank" rel="noopener noreffer">Docker</a>, and specifically <strong><a href="https://docs.docker.com/build/concepts/dockerfile/" target="_blank" rel="noopener noreffer">Dockerfiles</a></strong>. Docker is the engine that runs your Dockerfile on all platforms and architectures, letting you create a container image and build it for Linux on a Windows machine.</p>
<p>That makes containers so powerful: you can <strong>package and containerize complex data engineering requirements into a single Dockerfile</strong>, and everyone can run it on any machine—whether locally, in CI/CD pipelines, or orchestrated in Kubernetes clusters. Think of container packages on ships that transport goods; the breakthrough was the standardized container size that fits on every boat; every harbor could maneuver them. Similarly, container images have become the standard for packaging data and software ecosystems, with formats originally defined by Docker now being widely supported across <a href="https://kubernetes.io/blog/2020/12/02/dont-panic-kubernetes-and-docker/" target="_blank" rel="noopener noreffer">different container runtimes and platforms</a>.</p>
<p>A simple nginx (webserver) example:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-dockerfile" data-lang="dockerfile"><span class="line"><span class="cl"><span class="c"># Use the official NGINX image from Docker Hub</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="s">nginx:latest</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="c"># Copy your custom NGINX configuration file (if you have one)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="k">COPY</span> nginx.conf /etc/nginx/nginx.conf<span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="c"># Copy static website files to the appropriate directory</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="k">COPY</span> . /usr/share/nginx/html<span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="c"># Expose the port NGINX listens on</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="k">EXPOSE</span><span class="w"> </span><span class="s">80</span><span class="err">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>Docker also supports <a href="https://docs.docker.com/reference/dockerfile/" target="_blank" rel="noopener noreffer">different instructions</a> that you can use in a Dockerfile.</p>
<div class="details admonition info open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-info"></i>Different Architectures (amd64, arm64) and Line Feeds<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">When building <code>docker build</code> images, be aware of the <strong>different architectures</strong>. Whether you build Docker images or want to run them on other servers, <strong>line endings</strong> can cause issues in Dockerfiles and scripts—Windows uses CRLF (<code>\r\n</code>). In contrast, Linux/Mac uses LF (<code>\n</code>), which can break shell scripts and Docker builds. Use <code>.gitattributes</code> or configure your editor to use LF consistently.</div>
        </div>
    </div>
<div class="details admonition tip open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-tip"></i>Devcontainers<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">Like Docker, Devcontainers is an extra file in <code>devcontainer.json</code>. It works well with VS Code, allowing you to use Docker containers as full-featured development environments with predefined tools and runtime stacks.</div>
        </div>
    </div>
<h2 id="linux-de-fundamentals">Linux DE Fundamentals</h2>
<p>Even though you might use Windows, Linux is key to a data engineer. You don&rsquo;t need to be an expert, but you shouldn&rsquo;t be afraid of command line tools and know some basic Linux commands. And be aware that some of them are powerful.</p>
<h3 id="opening-and-editing-a-file-with-nanovim">Opening and Editing a File with Nano/Vim</h3>
<p>Editing or creating a new file might not be as easy as it seems. Command line text editors such as <a href="https://de.wikipedia.org/wiki/Nano_%28Texteditor%29" target="_blank" rel="noopener noreffer">Nano</a> or <a href="https://en.wikipedia.org/wiki/Vim_%28text_editor%29" target="_blank" rel="noopener noreffer">Vim</a> can be used for this task. Recommended is Nano, which displays the shortcuts to save or exit. Vim can be intimidating at first, but it&rsquo;s a <a href="https://www.ssp.sh/blog/why-using-neovim-data-engineer-and-writer-2023/" target="_blank" rel="noopener noreffer">worthwhile investment</a> when working 8 hours a day on the terminal, even more so <a href="https://www.ssp.sh/brain/vim-language-and-motions/" target="_blank" rel="noopener noreffer">Vim Motions</a>.<br>













  

























<figure>
<a target="_blank" href="/blog/data-engineering-toolkit/nano.png" title="/blog/data-engineering-toolkit/nano.png">

</a><figcaption class="image-caption">Example of editing above Dockerfile in Nano.</figcaption>
</figure></p>
<h3 id="basic-linux-tools-and-commands">Basic Linux Tools and Commands</h3>
<p>In addition to the Linux basic commands you have probably used or encountered like <code>cp, mv, ssh</code> as seen below, which are also super helpful on a server, we focus on the data engineering Linux commands you run on your laptop, where you can install things.</p>

 
 
 
 
 
 
 
 
 
 
 
 
   
 <figure><a target="_blank" href="/blog/data-engineering-toolkit/linux-basic.webp" title="">
 
 </a><figcaption class="image-caption">Image from <a href="https://blog.amigoscode.com/p/linux-is-a-must-seriously" target="_blank" rel="noopener noreffer">Linux is a MUST. Seriously&hellip;</a> | Also, check more on the book <a href="https://danieljbarrett.com/books/efficient-linux-at-the-command-line/" target="_blank" rel="noopener noreffer">Efficient Linux at the Command Line</a> by Daniel J. Barrett.</figcaption>
 </figure>
<p>Most tools are Python-related to achieve the core tasks of a data engineer: ingestion of data, transforming and serving it to the organization or users. But the additional DE Linux commands I often use to quickly check an API, copy something over, or check processes are:</p>
<ul>
<li><code>curl</code>: Quickly check an API is available through the cmd line.</li>
<li><code>make</code> / <code>cron</code>: Simple orchestration with the command line. More on this in the next chapter</li>
<li><code>ssh</code> / <code>rsync</code>: Ssh to connect to another machine and Rsync for a fast, versatile, synchronization tool to quickly back up or move data from your machine to the server.</li>
<li><code>bat</code>: Show data of a file nicely format and git integration.</li>
<li><code>tail</code>: Displays the last part of a file, which is helpful if the file is big and cat/bat would take long.</li>
<li><code>which</code>: Locate a program in the user&rsquo;s path to check if the right tool is running.</li>
<li><code>brew</code>: MacOS-specific package manager is the easiest way to install tools and cmd line utils.</li>
</ul>
<p>Related to the above basic Linux commands:</p>
<ul>
<li><code>grep</code>: Used for everything attached to an existing run. E.g. quickly search AWS env variables:</li>
</ul>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">❯ env | grep AWS
</span></span><span class="line"><span class="cl">AWS_ACCESS_KEY_ID=my-access-key
</span></span><span class="line"><span class="cl">AWS_BUCKET=my-bucket
</span></span><span class="line"><span class="cl">AWS_SECRET_ACCESS_KEY=my-secret
</span></span></code></pre></td></tr></table>
</div>
</div><ul>
<li><code>ps aux</code> and <code>htop</code>: To check the current process. Ps is also handy in combination with grep (<code>ps aux | my-program.py</code>)</li>
<li><code>rg</code> and <code>fzf</code>: Ripgrep (rg) is a recursive line-oriented search tool that searches through all files, and fzf is a fuzzy finder. In combination, you can interactively search fuzzy find the content of Python files in the current folder easily with <code>rg -t python &quot;def main&quot; . | fzf</code>. (Also check out <a href="https://www.ssp.sh/brain/recursive-search-in-terminal-with-fzf/" target="_blank" rel="noopener noreffer">Recursive Search in Terminal with fzf</a>, this will change your cmd-line life with reverse search <code>ctrl+r</code>).</li>
</ul>
<div class="details admonition tip open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-tip"></i>TUIs for More Efficiency (and Fun!)<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">If you frequently use <strong>git</strong> or <strong>docker</strong>, please check out <a href="https://github.com/jesseduffield/lazygit/" target="_blank" rel="noopener noreffer">Lazygit</a>, <a href="https://github.com/jesseduffield/lazydocker" target="_blank" rel="noopener noreffer">Lazydocker</a>, and <a href="https://github.com/derailed/k9s" target="_blank" rel="noopener noreffer">k9s</a>. These TUIs show all commands within a single command. Instead of memorizing or typing lengthy commands, you can use a graphical user interface in the terminal and navigate with the keyboard.</div>
        </div>
    </div>
<h3 id="simple-orchestration">Simple Orchestration</h3>
<p>The core responsibility of a data engineer is to orchestrate different jobs in the correct order and fully automate them. We use data orchestrators (<a href="https://github.com/apache/airflow" target="_blank" rel="noopener noreffer">Airflow</a>, <a href="https://github.com/dagster-io/dagster" target="_blank" rel="noopener noreffer">Dagster</a>, <a href="https://github.com/PrefectHQ/prefect" target="_blank" rel="noopener noreffer">Prefect</a> etc.), but Linux also covers us.</p>
<p><strong><a href="https://makefiletutorial.com/" target="_blank" rel="noopener noreffer">Makefile</a></strong> and <strong><a href="https://en.wikipedia.org/wiki/Cron" target="_blank" rel="noopener noreffer">cron</a></strong> jobs are out of the box and installed on every Linux system. For example, Makefiles let us store a combination of commands like this:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">API_URL :<span class="o">=</span> <span class="s2">&#34;https://api.coincap.io/v2/assets&#34;</span>
</span></span><span class="line"><span class="cl">DATA_DIR :<span class="o">=</span> /tmp/data
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">etl: extract transform load
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">extract:
</span></span><span class="line"><span class="cl">  mkdir -p <span class="k">$(</span>DATA_DIR<span class="k">)</span>
</span></span><span class="line"><span class="cl">  curl -s <span class="k">$(</span>API_URL<span class="k">)</span> <span class="p">|</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">    jq -r <span class="s1">&#39;.data[] | [.symbol, .priceUsd, .marketCapUsd] | @csv&#39;</span> &gt; <span class="se">\
</span></span></span><span class="line"><span class="cl">    <span class="k">$(</span>DATA_DIR<span class="k">)</span>/crypto_raw.csv
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">transform:
</span></span><span class="line"><span class="cl">  ./scripts/transform_data.sh
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">load:
</span></span><span class="line"><span class="cl">  cat <span class="k">$(</span>DATA_DIR<span class="k">)</span>/crypto_raw.csv <span class="p">|</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">    sort -t<span class="s1">&#39;,&#39;</span> -k3,3nr <span class="p">|</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">    head -n <span class="m">10</span> &gt; <span class="k">$(</span>DATA_DIR<span class="k">)</span>/top_10_crypto.csv
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">clean:
</span></span><span class="line"><span class="cl">  rm -rf <span class="k">$(</span>DATA_DIR<span class="k">)</span>/*
</span></span></code></pre></td></tr></table>
</div>
</div><p>Running <code>make extract</code> will create download data from the HTTPS API and store it as CSV, which we can check with <code>tail</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">❯ make extract
</span></span><span class="line"><span class="cl">mkdir -p /tmp/data
</span></span><span class="line"><span class="cl">curl -s <span class="s2">&#34;https://api.coincap.io/v2/assets&#34;</span> <span class="p">|</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">                jq -r <span class="s1">&#39;.data[] | [.symbol, .priceUsd, .marketCapUsd] | @csv&#39;</span> &gt; <span class="se">\
</span></span></span><span class="line"><span class="cl">                /tmp/data/crypto_raw.csv
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">❯ tail -n <span class="m">3</span> /tmp/data/crypto_raw.csv
</span></span><span class="line"><span class="cl"><span class="s2">&#34;ZEN&#34;</span>,<span class="s2">&#34;25.2499663234287359&#34;</span>,<span class="s2">&#34;399199442.5767759717054100&#34;</span>
</span></span><span class="line"><span class="cl"><span class="s2">&#34;SUSHI&#34;</span>,<span class="s2">&#34;1.4507020739095067&#34;</span>,<span class="s2">&#34;381986878.5063751499688694&#34;</span>
</span></span><span class="line"><span class="cl"><span class="s2">&#34;JST&#34;</span>,<span class="s2">&#34;0.0384023939139102&#34;</span>,<span class="s2">&#34;380183699.7477109800000000&#34;</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Combining these commands can be quick and super powerful. Make is just one example of storing and checking the commands into git so everyone can use them.</p>
<p><a href="https://en.wikipedia.org/wiki/Cron" target="_blank" rel="noopener noreffer">Crontabs</a> are another way to schedule them daily, for example.</p>
<h5 id="pipeline-command-join-different-commands-together">Pipeline command: Join different commands together <code>|</code></h5>
<p>In line with the <a href="https://en.wikipedia.org/wiki/Unix_philosophy" target="_blank" rel="noopener noreffer">Unix Philosophy</a>, to make one tool do one thing as best as possible, you can combine &ldquo;<a href="https://en.wikipedia.org/wiki/Pipeline_%28Unix%29" target="_blank" rel="noopener noreffer">pipe</a>&rdquo; different tools with <code>|</code> as we&rsquo;ve seen examples already above with <code>grep</code> and others.</p>
<p>Here is another example of checking if any Python packages for SQL have been installed</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">pip freeze <span class="p">|</span> grep SQL
</span></span></code></pre></td></tr></table>
</div>
</div><p>This allows the making of data pipelines within the terminal and a single cmd line by stacking different operations together. Example of powerful command chaining with pipes:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">❯ bat /tmp/data/crypto_raw.csv <span class="p">|</span> tr -d <span class="s1">&#39;&#34;&#39;</span> <span class="p">|</span> cut -d<span class="s1">&#39;,&#39;</span> -f1,3 <span class="p">|</span> sort -t<span class="s1">&#39;,&#39;</span> -k2 -nr <span class="p">|</span> head -n <span class="m">4</span> BTC,1920648934960.3101078883559601 ETH,386675369242.2018025632681003 XRP,161734797349.4803555794799785 USDT,137222181131.1690655355161784
</span></span></code></pre></td></tr></table>
</div>
</div><p>The pipeline reads the above CSV file and extracts the coin name and market cap only (using <code>cut</code>), removes the quotes (<code>tr</code>), and then sorts by the market cap value numerically in descending order to show the top 4 biggest cryptocurrencies by market capitalization.</p>
<h4 id="data-processing">Data Processing</h4>
<p>Another example could be data processing within the command line—e.g., quickly splitting a large CSV that you are unable to open with a text editor:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl"><span class="c1"># Split large CSV while keeping header</span>
</span></span><span class="line"><span class="cl">head -n1 large_file.csv &gt; header.csv
</span></span><span class="line"><span class="cl">split -l <span class="m">1000000</span> --filter<span class="o">=</span><span class="s1">&#39;tail -n +2&#39;</span> large_file.csv chunk_
</span></span><span class="line"><span class="cl"><span class="c1"># Add header back to each chunk</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> f in chunk_*<span class="p">;</span> <span class="k">do</span> cat header.csv <span class="s2">&#34;</span><span class="nv">$f</span><span class="s2">&#34;</span> &gt; <span class="s2">&#34;with_header_</span><span class="nv">$f</span><span class="s2">&#34;</span><span class="p">;</span> <span class="k">done</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>I hope you can imagine how you could build any small, efficient data pipeline with a Makefile and the Pipe commands.</p>
<h2 id="developer-productivity">Developer Productivity</h2>
<p>Next, we will look at the newer tools that can be added above the terminal and CLIs: powerful IDEs, notebooks, or workspaces, and git for version controlling everything.</p>
<h3 id="ide-working-environment">IDE (Working environment)</h3>
<p>An integrated development environment (IDE) is where we program our code and get code completion, linters, and AI assistance to make us (hopefully) more productive.</p>
<p>Popular IDEs are with their used based on the <a href="https://survey.stackoverflow.co/2024/technology#most-popular-technologies-new-collab-tools-prof" target="_blank" rel="noopener noreffer">StackOverflow Survey 2024</a>:</p>
<ul>
<li><strong><a href="https://code.visualstudio.com/" target="_blank" rel="noopener noreffer">Visual Studio Code</a></strong> (73.6%) - Microsoft&rsquo;s lightweight but powerful source code editor with extensive plugin support and language coverage.</li>
<li><strong><a href="https://visualstudio.microsoft.com/" target="_blank" rel="noopener noreffer">Visual Studio</a></strong> (29.3%) - Microsoft&rsquo;s full-featured IDE, powerful for .NET development and enterprise applications.</li>
<li>Other editors sorted percentage-wise are <a href="https://www.jetbrains.com/idea/" target="_blank" rel="noopener noreffer">IntelliJ IDEA</a> (26.8%), <a href="https://notepad-plus-plus.org/" target="_blank" rel="noopener noreffer">Notepad++</a> (23.9%), <a href="https://www.vim.org/" target="_blank" rel="noopener noreffer">Vim</a> (21.6%), <a href="https://www.jetbrains.com/pycharm/" target="_blank" rel="noopener noreffer">PyCharm</a> (15.1%), <a href="https://jupyter.org/" target="_blank" rel="noopener noreffer">Jupyter</a> (12.8%), <a href="https://neovim.io/" target="_blank" rel="noopener noreffer">Neovim</a> (12.5%), <a href="https://www.sublimetext.com/" target="_blank" rel="noopener noreffer">Sublime Text</a> (10.9%), <a href="https://www.eclipse.org/" target="_blank" rel="noopener noreffer">Eclipse</a> (9.4%), <a href="https://developer.apple.com/xcode/" target="_blank" rel="noopener noreffer">Xcode</a> (9.3%)</li>
</ul>
<p>Not even on the map 2024 were the IDEs that go all in with AI:</p>
<ul>
<li><strong><a href="https://cursor.sh/" target="_blank" rel="noopener noreffer">Cursor</a></strong> - A VS Code-based editor explicitly built for AI-assisted development, featuring GitHub Copilot integration and specialized AI tooling for code completion and refactoring.</li>
<li><strong><a href="https://www.windsurf.ai/" target="_blank" rel="noopener noreffer">Windsurf</a></strong> - An AI-first code editor designed to streamline development workflow with features like natural language code generation and intelligent code suggestions.</li>
<li><strong><a href="https://zed.dev/" target="_blank" rel="noopener noreffer">Zed</a></strong> - A high-performance, multiplayer code editor with AI capabilities created by former Atom developers.</li>
</ul>
<h3 id="codespaces-and-workspaces">Codespaces and Workspaces</h3>
<p>In addition to IDEs that are usually installed locally, we also have codespaces (or workspaces, depending on the naming) that live in the browser. These are super handy because everyone has the same environment, and the days of &ldquo;does not work on my machine&rdquo; are gone.</p>
<p>These tools include <strong><a href="https://github.com/features/codespaces" target="_blank" rel="noopener noreffer">GitHub Codespaces</a></strong>, <strong><a href="https://devpod.sh/" target="_blank" rel="noopener noreffer">Devpod</a></strong>, <strong><a href="https://replit.com/" target="_blank" rel="noopener noreffer">Replit</a></strong>, <strong><a href="https://stackblitz.com/" target="_blank" rel="noopener noreffer">Stackblitz</a></strong>, <strong><a href="https://codesandbox.io/" target="_blank" rel="noopener noreffer">CodeSandbox</a></strong> <strong><a href="https://www.gitpod.io/" target="_blank" rel="noopener noreffer">Gitpod</a></strong>, and many others.</p>
<h3 id="notebooks">Notebooks</h3>
<p>In addition to IDEs and Codespaces, you can use a notebook that runs locally or in the cloud. This option is generally more flexible and allows you to visualize results and document the code. However, putting it in production has a downside: It&rsquo;s harder to restart, backfill, or configure with different variables.</p>
<p>It’s more flexible and easier to get started, but transitioning notebooks to production remains challenging even on platforms like Databricks, which are designed to support a development-to-production workflow.</p>
<p>Notebooks like <strong><a href="https://jupyter.org/" target="_blank" rel="noopener noreffer">Jupyter Notebook</a></strong> / <strong><a href="https://jupyter.org/hub" target="_blank" rel="noopener noreffer">JupyterHub</a></strong>, <strong><a href="https://zeppelin.apache.org/" target="_blank" rel="noopener noreffer">Apache Zeppelin</a></strong>, or <strong><a href="https://www.databricks.com/product/collaborative-notebooks" target="_blank" rel="noopener noreffer">Databricks Notebook</a></strong>. Newer versions of Jupyter Notebooks with more integrated features and a robust cloud behind them are <strong><a href="https://deepnote.com/" target="_blank" rel="noopener noreffer">Deepnote</a></strong>, <strong><a href="https://hex.tech/" target="_blank" rel="noopener noreffer">Hex</a></strong>, and <strong><a href="https://count.co/" target="_blank" rel="noopener noreffer">Count.co</a></strong>, <strong><a href="https://ensoanalytics.com/" target="_blank" rel="noopener noreffer">Enso</a></strong>, or <strong><a href="https://motherduck.com/docs/getting-started/motherduck-quick-tour/" target="_blank" rel="noopener noreffer">MotherDuck</a></strong>, which combines the flexibility of notebooks with the power of DuckDB&rsquo;s analytics engine.</p>
<div class="details admonition note open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-note"></i>Spreadsheets<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">There is even one more category: <strong>spreadsheet-style apps</strong>. They are similar to notebooks as they can also run <strong>Python</strong> and <strong>JavaScript</strong> inside cells. Think <a href="https://www.quadratichq.com/" target="_blank" rel="noopener noreffer">Quadratic</a>, Excel, and others.</div>
        </div>
    </div>
<h3 id="git-version-control">Git Version Control</h3>
<p><a href="https://git-scm.com/" target="_blank" rel="noopener noreffer">Git</a> is probably the most used version control in data engineering nowadays. There was a time of <a href="https://tortoisesvn.net/" target="_blank" rel="noopener noreffer">TortoiseSVN</a> and others.</p>
<p>As a data engineer, you need to version your code and product to easily roll back in case of error or work together as a team. The most common git workflow are:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">git pull origin main <span class="c1"># Pull latest changes</span>
</span></span><span class="line"><span class="cl">git status <span class="c1"># Check status of your changes</span>
</span></span><span class="line"><span class="cl">git add pipeline.py <span class="c1">#stage</span>
</span></span><span class="line"><span class="cl">git commit -m <span class="s2">&#34;fix: update extraction logic for new API version&#34;</span> <span class="c1">#commit</span>
</span></span><span class="line"><span class="cl">git push origin main <span class="c1"># Push to remote repository</span>
</span></span><span class="line"><span class="cl">git checkout -b feature/new-data-source <span class="c1"># Create and switch to a new branch</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>For more complex operations, consider using a Git GUI client. Some popular options include <a href="https://www.gitkraken.com/" target="_blank" rel="noopener noreffer">GitKraken</a>, <a href="https://www.sourcetreeapp.com/" target="_blank" rel="noopener noreffer">SourceTree</a>, <a href="https://github.com/jesseduffield/lazygit" target="_blank" rel="noopener noreffer">Lazygit</a> (terminal UI), and <a href="https://github.com/dictcp/awesome-git#client" target="_blank" rel="noopener noreffer">many more</a>.</p>
<h2 id="data-engineer-programming-languages">Data Engineer Programming Languages</h2>
<p>Before we wrap up, let&rsquo;s look at a data engineer&rsquo;s programming language. This will change depending on whether you are working more on infrastructure, pipeline, or business extraction.</p>
<p>The most prominent language you will use is still <strong>SQL</strong>, as the language to query each BI tool, doing most transformations with dbt and others, and even having an API on the most popular DE libraries makes it the best first language to master. Just after, especially if you build a lot of data pipelines and do a bit above basic transformations, you won&rsquo;t get around <strong>Python</strong>. Python is the tooling language of a data engineer; think of it as the Swiss army knife.</p>
<p>Lastly, if you are in infrastructure and need to deploy the data stack, you primarily work with <strong>YAML</strong> as a definition language for Helm, Kubernetes, Terraform, or other deployments. You could write some Rust if you are developing infrastructure and performance-heavy optimization.</p>
<p>We can see the most popular languages as with the <a href="https://survey.stackoverflow.co/2024/technology#admired-and-desired" target="_blank" rel="noopener noreffer">StackOverflow 2024</a> data, query with DuckDB with a shared DB on MotherDuck—simply <a href="https://app.motherduck.com/" target="_blank" rel="noopener noreffer">sign up</a> (if you haven&rsquo;t) and <a href="https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/" target="_blank" rel="noopener noreffer">create a token</a> to query the database with this <a href="https://gist.github.com/sspaeti/64405c15ef5b0f969435195cbdd05c04" target="_blank" rel="noopener noreffer">SQL-query</a>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">┌─────────────────────────┬───────┬──────────────────────────────────────────┐
</span></span><span class="line"><span class="cl">│        language         │ count │                  chart                   │
</span></span><span class="line"><span class="cl">│         varchar         │ int64 │                 varchar                  │
</span></span><span class="line"><span class="cl">├─────────────────────────┼───────┼──────────────────────────────────────────┤
</span></span><span class="line"><span class="cl">│ JavaScript              │ <span class="m">37492</span> │ ████████████████████████████████████████ │
</span></span><span class="line"><span class="cl">│ HTML/CSS                │ <span class="m">31816</span> │ █████████████████████████████████▉       │
</span></span><span class="line"><span class="cl">│ Python                  │ <span class="m">30719</span> │ ████████████████████████████████▊        │
</span></span><span class="line"><span class="cl">│ SQL                     │ <span class="m">30682</span> │ ████████████████████████████████▋        │
</span></span><span class="line"><span class="cl">│ TypeScript              │ <span class="m">23150</span> │ ████████████████████████▋                │
</span></span><span class="line"><span class="cl">│ Bash/Shell <span class="o">(</span>all shells<span class="o">)</span> │ <span class="m">20412</span> │ █████████████████████▊                   │
</span></span><span class="line"><span class="cl">│ Java                    │ <span class="m">18239</span> │ ███████████████████▍                     │
</span></span><span class="line"><span class="cl">│ C#                      │ <span class="m">16318</span> │ █████████████████▍                       │
</span></span><span class="line"><span class="cl">│ C++                     │ <span class="m">13827</span> │ ██████████████▊                          │
</span></span><span class="line"><span class="cl">│ C                       │ <span class="m">12184</span> │ ████████████▉                            │
</span></span><span class="line"><span class="cl">├─────────────────────────┴───────┴──────────────────────────────────────────┤
</span></span><span class="line"><span class="cl">│ <span class="m">10</span> rows                                                          <span class="m">3</span> columns │
</span></span><span class="line"><span class="cl">└────────────────────────────────────────────────────────────────────────────┘
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="beyond-languages">Beyond Languages</h3>
<p>Beyond programming languages, you must get to know various <strong>databases and their concepts</strong>, such as <a href="https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf" target="_blank" rel="noopener noreffer">relational database theory</a>. It does not matter which SQL dialect you learn, as they are all related, but knowing the fundamentals of a specific database, such as Postgres, DuckDB, or a NoSQL database, will help you on your journey.</p>
<p>Python libraries and frameworks are the last we observe and where you can spend most of your time. Instead of learning as many as possible, I suggest investing in a few used at your company and where you benefit most.</p>
<p>Typical starter libraries include <a href="https://duckdb.org/docs/api/python/overview.html" target="_blank" rel="noopener noreffer">DuckDB</a> (a powerful in-memory transformation library and database with <a href="https://motherduck.com/blog/the-simple-joys-of-scaling-up/" target="_blank" rel="noopener noreffer">scale-up</a> capabilities via MotherDuck<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>), <a href="https://pandas.pydata.org/" target="_blank" rel="noopener noreffer">Pandas</a> (flexible data manipulation), <a href="https://arrow.apache.org/docs/python/index.html" target="_blank" rel="noopener noreffer">PyArrow</a> (optimized for columnar data), <a href="https://pola.rs/" target="_blank" rel="noopener noreffer">Polars</a> (fast and scalable DataFrame library), and <a href="https://spark.apache.org/docs/latest/api/python/index.html" target="_blank" rel="noopener noreffer">PySpark</a> (for distributed data processing with Apache Spark).</p>
<h3 id="python-libraries">Python Libraries</h3>
<p>There are many more libraries available, especially when you need to quickly access an API or perform a task that a CLI can&rsquo;t. Some key libraries can be beneficial depending on the use case you are working on.</p>
<p>Data Ingestion:</p>
<ul>
<li><a href="https://requests.readthedocs.io/en/latest/" target="_blank" rel="noopener noreffer">Requests</a> - HTTP library for API queries and web scraping</li>
<li><a href="https://www.crummy.com/software/BeautifulSoup/" target="_blank" rel="noopener noreffer">BeautifulSoup</a> - HTML parsing library for web scraping</li>
</ul>
<p>Developer Tools:</p>
<ul>
<li><a href="https://github.com/astral-sh/uv" target="_blank" rel="noopener noreffer">uv</a> / <a href="https://pip.pypa.io/" target="_blank" rel="noopener noreffer">pip</a> - Package installers for Python, with uv being a modern, fast alternative to pip</li>
<li><a href="https://docs.astral.sh/ruff/" target="_blank" rel="noopener noreffer">Ruff</a> - Fast linter and code formatter</li>
<li><a href="https://docs.pytest.org/" target="_blank" rel="noopener noreffer">Pytest</a> - A testing framework for Python</li>
</ul>
<p>Data Validation:</p>
<ul>
<li><a href="https://docs.pydantic.dev/" target="_blank" rel="noopener noreffer">Pydantic</a> - Data validation for Python objects</li>
<li><a href="https://pandera.readthedocs.io/" target="_blank" rel="noopener noreffer">Pandera</a> - Schema validation for dataframes</li>
<li><a href="https://github.com/great-expectations/great_expectations" target="_blank" rel="noopener noreffer">Great Expectations</a> / <a href="https://github.com/OpenLineage/OpenLineage" target="_blank" rel="noopener noreffer">OpenLineage</a> - Data quality validation framework and data lineage tracking tools</li>
</ul>
<p>We could go on forever. Libraries exist for virtually everything: data ingestion, orchestration, BI tools, you name it. We could discuss setting up a Python project (it&rsquo;s not a solved problem, and there are many ways of doing it), discuss DevOps and how to use a simple Helm script, set up a local storage system that mimics S3, and more.</p>
<h2 id="wrapping-up">Wrapping Up</h2>
<p>Instead, we wrap it up, and I hope you enjoyed this article. It gave you an overview and a sense of how much is asked from a data engineer these days. But as this might be overwhelming, I suggest always focusing on fundamentals and, second, taking it step by step. It&rsquo;s better to understand why than skip over it quickly. — Also, as we are in the AI area, use ChatGPT to explain a command or a CLI tool to you; it will do a much better job than any Google Search.</p>
<p>We&rsquo;ve covered the <strong>foundational</strong> tools and environments of modern data engineering, skills that are often overlooked but crucial for any data engineer. From selecting the proper OS and virtualization setup to mastering Linux fundamentals and CLIs, these building blocks enable efficient data pipeline development without always requiring complex tools.</p>
<p>This foundation reminds us that sometimes the simplest solution is the most effective—a well-chosen Linux command can often replace a complex toolchain. I hope that these technical skills, provided by a modern data engineer, will help you along your journey when working from the command line on your machine.</p>
<hr>
<p>MotherDuck strives for <a href="https://motherduck.com/docs/getting-started/" target="_blank" rel="noopener noreffer">modern data development</a> and developer productivity. For instance, its approach to developer productivity allows seamless scaling from local development to production: developers can work with DuckDB locally using <code>path: &quot;local.duckdb&quot;</code> for their development environment, then simply point their production environment to MotherDuck with <code>path: &quot;md:prod_database&quot;</code>. This lets engineers focus on feature implementation while MotherDuck handles the scaling and performance.</p>
<p>For a practical example, check out this implementation in the <a href="https://youtu.be/z3trqkKPbsI?si=mcLeiUi-5YBMs5oI&amp;t=613" target="_blank" rel="noopener noreffer">Deep Dive - Shifting Left and Moving Forward with MotherDuck</a>: </p>













  
<figure><a target="_blank" href="/blog/data-engineering-toolkit/motherduck-dagster.webp" title="">

</a><figcaption class="image-caption">Code snippet available on <a href="https://github.com/dagster-io/dagster/blob/1750e8fa2a2d56b38063baecc4257d650ffb15ef/examples/project_atproto_dashboard/dbt_project/profiles.yml#L19" target="_blank" rel="noopener noreffer">GitHub</a></figcaption>
</figure>
<hr>
<pre class=""><em>Full article published at <a href="https://motherduck.com/blog/data-engineering-toolkit-essential-tools/" target="_blank" rel="noopener noreferrer">MotherDuck.com</a> - written as part of <a href="/services">my services</a></em></pre>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>although it&rsquo;s not 100% the same, it&rsquo;s a good option and alternative to use both Windows and Linux in one, as someone who has used WSL extensively, if you are challenged to work mainly in Linux and the command line, Linux or MacOS are still the better option.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>This is also where MotherDuck makes all the difference for a simple local machine to experiment, but using the hybrid power of MotherDuck to scale up when needed, as Yuki <a href="https://www.amazon.com/Polars-Cookbook-practical-transform-manipulate-ebook/dp/B0CLRS4B8T" target="_blank" rel="noopener noreffer">shared</a>.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></description>
</item>
<item>
    <title>BI-as-Code and the New Era of GenBI</title>
    <link>https://www.ssp.sh/blog/bi-as-code-and-genbi/</link>
    <pubDate>Tue, 05 Nov 2024 17:19:32 &#43;0200</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/bi-as-code-and-genbi/</guid><enclosure url="https://www.ssp.sh/blog/bi-as-code-and-genbi/featured-image.png" type="image/png" length="0" /><description><![CDATA[<p>Imagine creating business dashboards by simply describing what you want to see. No more clicking through complex interfaces or writing SQL queries - just have a conversation with AI about your data needs. This is the promise of Generative Business Intelligence (GenBI).</p>
<p>At its core, GenBI delivers an <strong>unreasonably effective human interface</strong>, where we iterate quickly, based on BI-as-Code. A simplified version looks like this:</p>
<div class="mermaid" id="id-1"></div>
<p>But what makes this possible? The key lies in the declarative BI stack as discussed in <a href="/blog/rise-of-declarative-data-stack/" rel="">Part 1</a> - where <strong>dashboards and metrics are defined as code</strong> (like <code>covid_dashboard.yaml</code>) rather than hidden behind graphical user interfaces. The declarative approach gives AI models the context they can understand and work with: structured definitions of business metrics, relationships between facts and dimensions, and visualizations.</p>
<p>In this article, we want to explore the possibilities of GenBI today.</p>
<h2 id="understanding-genbi">Understanding GenBI</h2>
<p>Generative business intelligence changes how people interact with data, enabling AI-driven analytics. By combining the power of Generative AI, GenBI makes analytics more accessible through new human interaction methods. Generative BI enables these use cases, from creating dashboards to <strong>natural language querying</strong> (typing or talking), all the way down to the data model generation if you only have the source tables. We explain the model from a top-down, <a href="https://en.wikipedia.org/wiki/Conceptual_schema" target="_blank" rel="noopener noreffer">conceptual</a> idea of what we need, and we generate dashboards, metrics, data models (snowflake/star schema, relationships, joins, even grains), entities, or even data warehouse architectures.</p>
<p>A key aspect of GenBI is its declarative nature, as we discussed in the <a href="/blog/rise-of-declarative-data-stack/" rel="">Declarative Data Stack</a>. Whether the entire stack or not, one of the most critical layers is the <a href="https://www.ssp.sh/brain/metrics-layer/" target="_blank" rel="noopener noreffer">Metrics Layer</a>. Generative AI needs context to understand the data model and the company&rsquo;s business. The metrics layer and its semantic (also called semantic layer) are relevant for this process, as the metrics hold the semantic understanding. Let&rsquo;s see GenBI in action based on the declarative dashboard:</p>
<p>One step further: Data modeling languages such as LookML, MDX, MAQL, Malloy language, or SQL are critical. These languages allow humans to describe and model our metrics and KPIs, from which <strong>AI models</strong> can learn. They are expressive and declarative and define the company&rsquo;s business logic.</p>
<p>To illustrate how GenBI leverages a declarative dashboard, let&rsquo;s see how Al can automatically add a new measure for <code>average fare cost per mile</code> to our existing metrics within the text editor:</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/Th5Krj14DCI?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>In summary, combining AI models with BI tools allows business users to query data, generate reports, and derive insights using conversational language, making data analytics more accessible.</p>
<div class="details admonition question open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-question"></i>Is GenBI the new Self-Service BI?<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">Self-service BI has been around for a while, trying to enable business users to create dashboards and reports. But that transition has always been hard, and without SQL or even Python skills to wrangle and clean your data, it&rsquo;s hard. I think GenBI is the next attempt for Self-Service BI, which is very promising.</div>
        </div>
    </div>
<h3 id="evolution-from-traditional-bi-to-genbi">Evolution from Traditional BI to GenBI</h3>
<p>Before we discuss GenAI, let&rsquo;s understand the difference between today&rsquo;s BI and GenBI.</p>
<p>To grasp that, let&rsquo;s quickly follow the evolution of SQL, and how it shaped the BI tools. From traditional data marts and materialized views queried directly by the BI tool, we&rsquo;ve built data warehouses, lakes, or lakehouses with tools such as dbt and data warehouse automation tools. Sometimes, we added an OLAP cube if we needed fast response times, typically modeled as OBT tables or wide denormalized tables.</p>
<p>In all of the different ways, SQL has been at the heart, even more with complex data pipelines and semantic layers. Even <a href="https://en.wikipedia.org/wiki/Large_language_model" target="_blank" rel="noopener noreffer">Large Language Models (LLMs)</a> made it <a href="https://motherduck.com/blog/sql-llm-prompt-function-gpt-models/" target="_blank" rel="noopener noreffer">into SQL</a> statements.</p>
<p>With GenBI, this will not change as the BI tools need to execute SQL at the end of the day. But what is important is that the metrics and the SQL statement are declarative stored. Therefore, it might be easier to work with YAML to maintain complex definitions, at least until we <a href="https://www.datacouncil.ai/talks/cubing-and-metrics-in-sql" target="_blank" rel="noopener noreffer">extend the SQL syntax for analytics</a>.</p>
<h3 id="how-genai-powers-bi-and-compare-to-genbi">How GenAI Powers BI and compare to GenBI</h3>
<p>The main difference between traditional BI and GenBI is understanding the <strong>semantics behind SQL</strong> with AI. What does that mean?</p>
<p>SQL is a declarative language; therefore, the AI model can learn from the queries. If we feed it with more data, such as the metrics in a metrics layer as part of the BI tool, and if available, the data model (DDL), the model will have a semantic understanding of queries. If you have a declarative dashboard tool such as Rill, you can train the model on the dashboards so it learns how we create them within the company, too.</p>
<p>GenAI has an LLM trained on our business intelligence artifacts. In return, it can generate visualizations in the form of dashboards and metrics in the form of SQL aggregation for us.</p>
<p>The other part of GenBI is what we use every day. With the rise of ChatGPT and similar tools, we can interface with our tools in a more human-like manner. Instead of writing SQL, we can speak or write natural language to query data or present it in a dashboard, compared to hand-crafting. These are two use cases, but the <strong>applications are endless</strong>, and new ways of interaction arise every day.</p>
<p>Ultimately, if we talk about GenBI, we must mention GenAI as the driving force and integrate AI technologies with BI tools and data sources. GenAI is a broad term encompassing all generative AI technologies, while <strong>GenBI is a specialized application</strong> of GenAI focused on business intelligence.</p>
<div class="details admonition note open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-note"></i>What is Semantic SQL?<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">In this context, it might be helpful to understand semantic SQL. It <strong>represents</strong> business concepts in SQL queries, making complex data accessible without requiring technical database knowledge. It gives business users direct access to data through simplified terminology. Abstract Syntax Threes (AST) can be used for visualization and deciphering the relationships, dependencies, and connections behind the queries. These enable further features like automated dependency tracking and cross-dialect compatibility.</div>
        </div>
    </div>
<div class="details admonition tip open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-tip"></i>General Knowledge LLM speaks GenBI<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">OpenAI has trained LLMs with millions of StackOverflow posts, so you don&rsquo;t have to. Prompting an AI to generate SQL statements for common queries has become straightforward. They have been trained on the usual Stripe and Shopify data models and columns. For example, related GenBI implementations, such as GitHub&rsquo;s Copilot, offer similar capabilities and generate SQL and code within your GitHub environment. How we do that with BI is discovered in a later chapter, &ldquo;GenBI in action&rdquo;.</div>
        </div>
    </div>
<h2 id="bi-as-code-the-foundation-of-genbi">BI-as-Code: The Foundation of GenBI</h2>
<p>We&rsquo;ve gone through a long evolution of mouse-clicking first dashboard tools, where everything you define lives within the UI. These tools produce unwieldy exports with hard-coded IDs and visual coordinates, often spanning thousands of lines of XML/JSON that are hard to version or modify systematically.</p>













  

























<figure>
<a target="_blank" href="/blog/bi-as-code-and-genbi/mermaid.jpg" title="/blog/bi-as-code-and-genbi/mermaid.jpg">

</a><figcaption class="image-caption">Traditional BI vs. BI-as-Code vs. GenBI</figcaption>
</figure>
<p>The benefits of each are clearly visible:</p>
<ol>
<li><strong>Graphical and traditional BI approach</strong>: Initially fast but slow with iterations. With manual changes, error prone.</li>
<li><strong>Code-First Approach</strong>: Scales well with complexity. Works well with teams.</li>
<li><strong>GenBI</strong>: Instant interactions through general knowledge LLMs. The best of both worlds makes it more approachable for businesses and users with the natural interface.</li>
</ol>
<h3 id="benefits-of-code-first-analytics">Benefits of Code-First Analytics</h3>
<p>Today, newer code-first approaches let you define dashboards declaratively, bringing all the advantages of the <a href="/blog/rise-of-declarative-data-stack/" rel="">Declarative Data Stack</a>: automation, versioning, and separation of business logic from implementation. This approach offers the best of both worlds – an intuitive UI for design while producing clean YAML definitions that can be versioned and bulk-modified with <code>search + replace</code> across all dashboards for example.</p>
<p>Code-first enables a <strong>smaller data team to do more</strong>. Think of the Ruby on Rails developer, who does end-to-end from changing the database theme to the front of the app. Instead of multiple people understanding every little column of a Shopify database model, <strong>we can prompt</strong> the BI model that has been trained on the physical data tables and potentially 100 or reference implementations from GitHub to generate a &ldquo;dashboard for the sum of orders per month&rdquo;. This will not be 100% accurate, but you might get 80% of the work within a very short time.</p>
<p>Instead of generating just the dashboard, it could create a central metrics layer repository to enhance versioning and maintainability, which would improve governance.</p>
<p>But beyond maintainability - it creates a <strong>semantic foundation</strong> that AI can understand. When dashboards are defined in code, they explicitly declare:</p>
<ul>
<li>The metrics being visualized and how they&rsquo;re calculated</li>
<li>The relationships between different data dimensions</li>
<li>The business logic behind aggregations and transformations</li>
<li>The visual hierarchy and organization of information</li>
</ul>
<p>While traditional BI tools usually hide these relationships in opaque UI configurations, code-based definitions make them machine-readable and learnable. The metrics layer becomes a natural extension, where business definitions are codified consistently and versionable, creating the foundation for an AI model to understand the business context. This enables <strong>automation and the generation</strong> of visualization.</p>
<p>This is why <strong>declarative is a prerequisite to GenBI</strong>. While AI can <a href="https://www.anthropic.com/news/3-5-models-and-computer-use" target="_blank" rel="noopener noreffer">control visual interfaces</a>, this approach is inefficient for complex business logic. The data model joins, metric aggregations and <a href="https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/kimball-data-warehouse-bus-architecture/" target="_blank" rel="noopener noreffer">bus matrix</a> require explicit declarations that AI can parse and understand; print-screened images are insufficient.</p>
<h3 id="how-would-a-genbi-workflow-look-like">How Would a GenBI Workflow Look Like?</h3>
<p>BI-as-code enables GenBI workflows. A potential workflow would involve a GitHub PR based on a GitHub repo as the persistence layer, where humans and AI brainstorm.</p>
<p>Considering the initial flow chart, where humans could create the prompt within a PR, the model generates its artifacts and commits to it. The human <strong>analyzes, verifies, and iterates</strong> on the prompt until he is happy. When finished, he approves and merges the PR, which will be deployed to production.</p>
<p>This would allow for an excellent <strong>review stage</strong>, during which humans and AI can iterate with a persistent store on GitHub. This could work for visualizations (dashboards), business logic (metrics layer), and data models (DDL for data warehouses).</p>
<div class="details admonition tip open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-tip"></i>This cycle is similar to the Master Data Management process<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">In MDM, a person always approves the process. With AI, we validate the AI&rsquo;s generated code instead of human code or data. <a href="https://pages.cs.wisc.edu/~anhai/papers1/hilda18.pdf" target="_blank" rel="noopener noreffer">Human-in-the-loop data Analysis (HILDA)</a> approaches, which emphasize end-to-end systems and foster data-centric communities, support this human-AI collaborative approach to data management.</div>
        </div>
    </div>
<h3 id="from-conceptual-to-physical-data-model">From Conceptual to Physical Data Model</h3>
<p>If we zoom out, we can use GenBI to model the <a href="https://en.wikipedia.org/wiki/Conceptual_schema" target="_blank" rel="noopener noreffer">conceptual</a> all down to the <a href="https://en.wikipedia.org/wiki/Physical_schema" target="_blank" rel="noopener noreffer">physical layer</a> in a top-down approach.</p>
<div class="mermaid" id="id-2"></div>
<p>Mapping each stage to a BI artifact, with GenBI&rsquo;s strength being more to the left, simply because we will have more context:</p>
<ul>
<li>Conceptual = <strong>Dashboard generation</strong></li>
<li>Logical = <strong>Metrics layer modeling</strong></li>
<li>Physical = <strong>Table and Joins (FKs) modeling</strong></li>
</ul>
<h2 id="core-components-and-architecture">Core Components and Architecture</h2>
<p>If we look at the <strong>landscape of GenBI</strong>, it&rsquo;s not yet well defined. Below is an attempt to highlight the different components that interact with each other in greater detail to achieve GenBI.</p>













  

























<figure>
<a target="_blank" href="/blog/bi-as-code-and-genbi/genbi-architecture.jpg" title="/blog/bi-as-code-and-genbi/genbi-architecture.jpg">

</a><figcaption class="image-caption">GenBI Architecture and Components</figcaption>
</figure>
<p>The most critical core components of GenBI are:</p>
<ul>
<li><strong>Business Intelligence</strong>: BI-as-Code tool at the center and interface for human interactions. Declarative dashboards are stored with BI-as-Code and integrated with the <em>metrics layer</em>. The metrics layer serves as a repository of business definitions, including <em>measures and data model relationships</em> (joins, star/snowflake schema). Together with the declarative dashboards and metrics, they provide essential semantic context back to the AI engine for generating and modifying BI artifacts.</li>
<li><strong>GenBI Core</strong>: Consists of three key components working together:
<ul>
<li><em>Natural Language Interface</em>: Enables human-like interaction with the system</li>
<li><em>AI Engine</em>: Processes queries and understands the business context</li>
<li><em>BI-as-Code Generator</em>: Transforms AI understanding into concrete, declarative code artifacts like YAML dashboard definitions and metrics configurations</li>
</ul>
</li>
<li><strong>External Knowledge</strong>: Combines general knowledge from external LLMs with business-specific context through RAG (Retrieval-Augmented Generation). While LLMs provide a broad understanding, the RAG store enriches this with internal documents, legacy code, and company-specific data models.</li>
<li><strong>Data Sources</strong>: Serve as the foundation where the BI-as-Code Generator queries to validate and execute the generated artifacts against actual data. Represents any kind of database or files.</li>
</ul>
<h3 id="key-components-for-successful-genbi">Key Components for Successful GenBI</h3>
<p>To make BI successful, it needs to provide an instant overview of company performance to allow the leadership to make critical decisions fast. BI should boost data efficiency by automating repetitive tasks across the organization, such as updating daily active users and sending them to relevant stakeholders.</p>
<p>To achieve successful GenBI, we need to highlight at least two main components that need to be well integrated into GenAI to make BI successful:</p>
<ul>
<li><strong>Dashboards</strong>: Quickly visualize the hard work of data engineers, the ingestion, wrangling, cleaning, and transforming of the data pipeline for the business and users.</li>
<li><strong>Metrics Layer</strong>: Auto-generate definitions for standard metrics like monthly active users, year-to-date revenue, etc.</li>
</ul>
<p>If we can further automate this process, we will overcome one of the biggest problems of BI: the <strong>bottleneck</strong> of the data or BI teams to ship or publish data artifacts. Also, the BI stack usually requires skills different from those of a domain expert. We have won a lot if it empowers domain experts who understand the business, how it runs, and how to present the numbers to tell a story.</p>
<p>You need to understand the company and its operations and know how to present the numbers in a way that tells a story.</p>
<p>This is usually where technical data professionals care least, so a trained model that helps us would significantly boost productivity. This is precisely where GenBI helps address these challenges.</p>
<h4 id="presentation-layer-dashboards">Presentation Layer: Dashboards</h4>
<p>Dashboards in GenBI serve as the primary interface between business insights and decision-makers. The iterative nature of dashboard development makes them particularly well-suited for AI assistance, with two distinct phases:</p>
<p>The <strong>initial creation</strong> process focuses on rapidly transforming business requirements into visualizations. GenBI excels here by understanding the context from the metrics layer and suggesting appropriate visualizations based on data characteristics and best practices.</p>
<p>GenBI truly shines in the <strong>iterative process</strong> of incorporating new features or data into the dashboard. It can provide feedback quickly, suggest improvements, and adapt visualizations based on user needs and data patterns. As dashboards are never truly finished, GenBI is an invaluable partner in continuous refinement, helping maintain consistency while reducing technical overhead.</p>
<div class="details admonition note open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-note"></i>Beautiful Visualization<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content"><p>Effective data visualization often produces clarity through simplicity rather than complexity. Methodologies such as <a href="https://www.ibcs.com/?taxonomy=product_shipping_class&amp;term=poster-su" target="_blank" rel="noopener noreffer">Hichert SUCCESS Rules</a> , <a href="https://www.amazon.com/Information-Dashboard-Design-Effective-Communication/dp/0596100167" target="_blank" rel="noopener noreffer">Information Dashboard Design</a>, or the works of <a href="https://www.edwardtufte.com/books/" target="_blank" rel="noopener noreffer">Edward Tufte</a> for business reporting. Key principles include:</p>
<ul>
<li>Use color strategically and sparingly</li>
<li>Emphasize important data through selective highlighting</li>
<li>Choose appropriate chart types for your data</li>
<li>Employ direct labeling when possible</li>
<li>Consider small multiples for complex comparisons</li>
</ul></div>
        </div>
    </div>
<h4 id="the-power-of-the-metrics-layer">The Power of the Metrics Layer</h4>
<p>The metrics layer serves as GenBI&rsquo;s semantic foundation, providing a declarative way to define and maintain core business definitions, including measures and data model relationships. GenBI creates a single source of truth that both humans and AI models can understand by centralizing these definitions in code. This is essential for consistently interpreting business logic across the organization, from simple metrics to complex star/snowflake schema relationships.</p>
<p>Beyond basic measure definitions, the metrics layer captures the semantic relationships between business concepts through explicit <strong>data model definitions</strong> like joins and fact/dimension relationships. This structured context enables the AI engine to understand individual metrics and how they relate to each other and the underlying data model. This comprehensive semantic understanding creates a <strong>powerful foundation for the AI engine</strong> when combined with declarative dashboards from the BI-as-Code tool.</p>
<p>The engine can generate contextually appropriate visualizations, suggest relevant metrics for specific business questions, and understand complex business calculations and their relationships. It can even auto-generate new metric definitions based on existing patterns while validating generated artifacts against the actual data model, ensuring consistency and accuracy in the business intelligence layer.</p>
<div class="details admonition note open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-note"></i>High-Performance Analytics Backend<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content"><p>If I had to add a third candidate, it would be an ultra-fast OLAP query backend to produce useful BI and AI responses. Modern OLAP engines like ClickHouse, Druid, Cube, and DuckDB provide the necessary speed and efficiency to handle complex analytical queries in real time, ensuring that GenBI systems can maintain interactive response times even when processing large datasets.</p>
<p>This performance capability is crucial for maintaining a fluid conversation between the users and the GenBI system. It allows for rapid iteration and refinement of analyses without interrupting the natural flow of exploration.</p>
</div>
        </div>
    </div>
<h2 id="genbi-in-action">GenBI in Action</h2>
<p>Let&rsquo;s examine GenBI&rsquo;s practical implications through example prompts. Later, we will look at an actual implementation with Rill.</p>
<h3 id="prompts-for-dashboard-creation">Prompts for Dashboard Creation</h3>
<p>Let&rsquo;s start with the user perspective and the prompts. Below are typical natural language requests (whether written or spoken) on how to &ldquo;prompt engineer&rdquo; a beautiful, fully functional dashboard. These prompts range from simple to advanced to interactive use cases that GenBI can help us with.</p>
<p>From <strong>simple</strong> metrics and visualizations:</p>
<blockquote>Create a metric for revenue based on data in my ORDERS table.</blockquote> <blockquote>Create a chart showing order revenue broken down by product category in the last year.</blockquote> <blockquote>Create a visualization of revenue to sales goal this month, and set the sales goal at $28,283.</blockquote>
<p>To more <strong>formatting</strong> and <strong>style update</strong> refinements:</p>
<blockquote>Optimize for a cleaner visual aesthetic based on Edward Tufte's design guidelines.</blockquote> <blockquote>Update on-brand color palette for this visualization based on the colors from our logo (upload logo)</blockquote>
<p>Create more <strong>advanced</strong> dashboards with a specific focus:</p>
<blockquote>Create a cohort analysis dashboard showing customer retention rates over 12 months, with the ability to filter by acquisition channel and highlight cohorts that exceeded 80% retention</blockquote> <blockquote>Build a drill-down capable sales dashboard that starts with global performance but allows users to click through to regional, store, and individual product level metrics, maintaining consistent visual language throughout the hierarchy</blockquote>
<p>Moving beyond basic visualizations, you can <strong>iteratively refine</strong> dashboards:</p>
<blockquote>1. Add a detailed table below the main "trend chart" showing monthly breakdowns<br>2. Update the color scheme to use black for current year data and gray for previous year comparisons<br>3. Add sparklines to the table columns showing 6-month trends<br>4. Enable drill-down on product categories to show individual SKU performance<br>5. Add conditional formatting to highlight values that are >10% below target in red<br>6. Create a collapsible section for additional metrics like margin and inventory turnover<br>7. Add hover tooltips showing YoY growth percentages</blockquote>
<p>These are examples of prompt dashboards. Let&rsquo;s look at metric-specific generations.</p>
<h3 id="prompts-for-metrics-generation">Prompts for Metrics Generation</h3>
<blockquote>Exclude all refunds and returns from my revenue metric.</blockquote>
<p>From <strong>basic</strong> metric definitions:</p>
<blockquote>Create a metric for total revenue as the sum of order prices, define monthly active users based on our users table and calculate the average order value from our orders table</blockquote>
<p>Generate star-schema <strong>data model metrics</strong> based on source DDL:</p>
<blockquote>Analyze my source tables DDLs from Stripe (`payments, subscriptions, customers`), SAP (`orders, inventory, suppliers`), and our CRM (`interactions, support_tickets`) to create a consolidated dimensional model.
<p>Generate appropriate fact and dimension tables, define key metrics like revenue, customer lifetime value, and order frequency, and include common aggregations and filters.</blockquote></p>
<p><strong>Iterative refinement</strong> through conversation:</p>
<div class="details admonition quote open">
        <div class="details-summary admonition-title admonition-title-none"></div>
        <div class="details-content">
            <div class="admonition-content">Human: &ldquo;Create a revenue metric from our subscription and one-time payments&rdquo;<br>
GenBI: &ldquo;How about: <code>SUM(CASE WHEN type = 'subscription' THEN mrr * months ELSE amount END</code>)&rdquo;<br>
Human: &ldquo;Good, but we need to handle refnds and currencies&rdquo; GenBI: &ldquo;Updated to: <code>SUM( CASE WHEN type = 'subscription' THEN mrr * months ELSE amount END * exchange_rate ) FILTER (WHERE status != 'refunded')</code><br>
Human: &ldquo;Perfect, add it to the metrics layer as <code>net_revenue</code>&rdquo;</div>
        </div>
    </div>
<p>These examples showcase the range from simple aggregations to complex business logic while highlighting the conversational nature of GenBI and its ability to understand and implement business requirements iteratively.</p>
<h3 id="practical-implementation-with-rill-developer">Practical Implementation with Rill Developer</h3>
<p>Today, <a href="https://docs.rilldata.com/" target="_blank" rel="noopener noreffer">Rill Developer</a> uses GenAI to create dashboards or metrics based on your data sources or models. It also stores all your sources, models (SQL queries), and dashboards in YAML and has a live editor for visually changing them.</p>
<p>You can right-click on your model to use AI with the comfort of your BI tools (this works in the OSS and Cloud version):</p>













  
<figure><a target="_blank" href="/blog/bi-as-code-and-genbi/rill-genai.webp" title="">

</a><figcaption class="image-caption">Rill GenAI in Action</figcaption>
</figure>
<p>If you click, generate dashboard, that would look like this based on the infamous NYC taxi data set. Notice the beautiful <a href="https://www.rilldata.com/blog/introducing-the-rill-pivot-table" target="_blank" rel="noopener noreffer">pivot table</a> at the end:</p>
<p>




</p>
<p>This implementation demonstrates the <a href="https://www.rilldata.com/blog/one-click-dashboards-with-generative-ai-and-bi-as-code" target="_blank" rel="noopener noreffer">future of one-click dashboards with generative AI</a>, where LLMs like GPT can generate and modify dashboard definitions directly in code.</p>
<p>Rill’s BI-as-code philosophy means dashboards are defined entirely in code and that large language models, like OpenAI’s GPT series, can generate based on these definitions. Rill today <strong>does an OpenAI</strong> call to get domain-specific understanding, e.g., if the data model is a certain industry.</p>
<p>Incorporating code generation into development environments makes the tool much more user-friendly and can significantly improve a single BI engineer&rsquo;s productivity. Rill Developer also provides software engineers with a fast feedback loop, enabling them to edit code rapidly, visualize metrics, and instantly preview dashboards before deploying them into <a href="https://www.rilldata.com/product" target="_blank" rel="noopener noreffer">Rill Cloud</a> or a self-managed environment.</p>
<p>What else does Rill Developer bring to the table? Rill is a CLI-first BI tool <strong>installed with a single command</strong>, opening up all kinds of new use cases, like embedding a complete GenBI tool as part of your data pipeline. Compared to traditional BI tools, that won&rsquo;t be possible any time soon. With that and its declarative approach, its integrations in the most common OLAP engines and <a href="https://docs.rilldata.com/build/connect/" target="_blank" rel="noopener noreffer">common sources</a> meet all GenBI requirements.</p>
<h4 id="get-started-with-rill-developer-and-genbi-today">Get Started with Rill Developer and GenBI Today</h4>
<p>One-line installation:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">curl https://rill.sh <span class="p">|</span> sh
</span></span></code></pre></td></tr></table>
</div>
</div><p>To run as a sample project:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">rill start my-rill-project
</span></span></code></pre></td></tr></table>
</div>
</div><p>Check the <a href="https://docs.rilldata.com/home/get-started" target="_blank" rel="noopener noreffer">Quick Start</a> for more information.</p>
<h2 id="future-of-genbi">Future of GenBI</h2>
<p>After seeing what GenBI is, I understand the evolution from traditional BI to GenAI, using BI-as-Code as the foundation to enable powerful AI use cases that generate full dashboards and metrics based on existing data sources. We&rsquo;ve seen the core components GenBI needs and how GenBI looks in action, both from the user perspective with prompts and from the implementation side with actual products that have implemented GenBI. For businesses, this means faster time-to-insight and democratized access to data analytics without sacrificing the robustness of traditional BI approaches.</p>
<p>So, what&rsquo;s the future of GenBI? While this is difficult to predict, I&rsquo;m confident Rill Developer will be at the forefront, enabling users with simple business requirements to access the entire BI stack—maybe even beyond. In the following article, we&rsquo;ll look at how a declarative data stack could be implemented, integrating the BI part and end-to-end what we discussed in Part 1 and the <a href="/blog/rise-of-declarative-data-stack/" rel="">Rise of the Declarative Data Stack</a>.</p>
<hr>
<pre class=""><em>Full article published at <a href="https://www.rilldata.com/blog/bi-as-code-and-the-new-era-of-genbi" target="_blank" rel="noopener noreferrer">Rilldata.com</a> - written as part of <a href="/services">my services</a></em></pre>
]]></description>
</item>
<item>
    <title>The Rise of the Declarative Data Stack</title>
    <link>https://www.ssp.sh/blog/rise-of-declarative-data-stack/</link>
    <pubDate>Wed, 16 Oct 2024 16:19:32 &#43;0200</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/rise-of-declarative-data-stack/</guid><enclosure url="https://www.ssp.sh/blog/rise-of-declarative-data-stack/featured-image.png" type="image/png" length="0" /><description><![CDATA[<p>Data stacks have come a long way, evolving from monolithic, one-fits-all systems like Oracle/SAP to today&rsquo;s modular open data stacks. This begs the question, what&rsquo;s next? Or why is the current not meeting our needs?</p>
<p>As we see more analytics engineering and software best practices, embracing codeful, Git-based, and more CLI-based workflows, the future looks more code-first. Beyond SQL transformations, across the entire data stack. From ingestion to transformation, orchestration, and measures in dashboards—all defined declaratively.</p>
<p>But what does this shift towards declarative data stacks mean? How does it change how we build and manage data stacks? And what are the implications for us data professionals? Let&rsquo;s find out in this article.</p>
<h2 id="a-brief-history-of-declarative-systems">A Brief History of Declarative Systems</h2>
<p>Often, we forget how hard it was in the old days.</p>
<p>Let&rsquo;s take <a href="https://en.wikipedia.org/wiki/Fortran" target="_blank" rel="noopener noreffer">Fortran</a>, one of the earliest high-level programming languages. It was revolutionary for its time, but it required programmers to think in terms of the computer&rsquo;s architecture and know the ins and outs of everything.</p>
<p>Or SQL, did you know it&rsquo;s declarative? Or Markdown, HTML, or Kubernetes? Most of them are growing in enthusiasm and support each year<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>, but why? The below table gives you an idea of how declarative is more straightforward, whereas imperative is very verbose and hard to read.</p>
<table>
  <thead>
    <tr>
      <th>Comparison</th>
      <th>Declarative</th>
      <th>Imperative</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Data Transformation (NumPy vs. Fortran)</strong></td>
      <td><pre><code>import numpy as np
result = np.mean(np.array
  ([1, 2, 3, 4, 5]))</code></pre></td>
      <td><pre><code>PROGRAM AVERAGE
  REAL :: numbers(5), sum, average
  DATA numbers /1.0, 2.0, 3.0, 4.0, 5.0/
  sum = 0.0
  DO i = 1, 5
    sum = sum + numbers(i)
  END DO
  average = sum / 5
  PRINT *, 'Average is:', average
END PROGRAM AVERAGE</code></pre></td>
    </tr>
    <tr>
      <td><strong>Text Formatting (Markdown vs. Manual Text Manipulation)</strong></td>
      <td><pre><code>**Bold text**</code></pre></td>
      <td><pre><code>let text = "Bold text";
console.log("\x1b[1m" 
+ text + "\x1b[0m");</code></pre></td>
    </tr>
    <tr>
      <td><strong>Data Query (SQL vs. Procedural Code)</strong></td>
      <td><pre><code>SELECT * FROM users 
WHERE age &gt; 18;</code></pre></td>
      <td><pre><code>for user in users:
    if user.age &gt; 18:
        result.append(user)</code></pre></td>
    </tr>
    <tr>
      <td><strong>Container Orchestration (Kubernetes vs. Manual Setup)</strong></td>
      <td><pre><code>kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-app
        image: my-app:1.0</code></pre></td>
      <td>(Multiple manual steps to set up and manage containers across multiple servers)</td>
    </tr>
  </tbody>
</table>
<ol>
<li>SQL for database queries</li>
<li>Functional programming languages</li>
<li>Declarative UI frameworks (e.g. Rill, Evidence, React)</li>
<li>Configuration management tools (e.g. Kubernetes)</li>
</ol>
<h3 id="examples-of-declarative-data-stack-airflowdbtkubernetesrill">Examples of Declarative Data Stack (Airflow/dbt/Kubernetes/Rill)</h3>
<p>With Kubernetes, you declare what you want, and the system will bring it to that state. This means the how-to is abstracted away, and you can focus on the <em>what</em> of your app.</p>
<p>On the other hand, using <strong>Airflow</strong>, you define every little step of the how inside your DAG; mixing business logic with technical boilerplate makes it inherently complicated and hard to decouple and get clarity in your runtime pipelines.</p>
<p>If you see <strong>Dagster</strong>, where you have &ldquo;only&rdquo; the business logic, what you want to do as part of your DAG, and everything else is outsourced into <a href="https://docs.dagster.io/concepts/resources" target="_blank" rel="noopener noreffer">resources</a> or <a href="https://docs.dagster.io/concepts/dagster-pipes" target="_blank" rel="noopener noreffer">Dagster Pipes</a>, defining your compute with a <a href="https://github.com/dagster-io/dagster-modal-demo/blob/60f889cb7ce15fb09da4680bf314cf1d35095d7d/dagster_modal_demo/pipeline_factory.py#L74" target="_blank" rel="noopener noreffer">simple annotation</a> <code>compute_kind=&quot;modal&quot;</code> or <code>compute_kind=&quot;spark&quot;</code>, not needing to care if it&rsquo;s locally or on production, shows you the power of declarative with the side-effect of knowing your tools.</p>
<p>Comparing <strong>dbt with SQLMesh</strong> is another one, where SQLMesh understands plus parses the SQL with <a href="https://github.com/tobymao/sqlglot" target="_blank" rel="noopener noreffer">SQLGlot</a> to get a semantic understanding of what the SQL does, giving it advances for powerful context-aware functions that dbt won&rsquo;t be able to do.</p>
<p>Semantic Layers and BI tools define metrics within YAML (<strong>Cube, Rill</strong>), dashboards can be created entirely in Markdown (Evidence), and data pipelines are created on top of YAML (Kestra, Dagster).</p>
<p>Call it Infrastructure as Code, Visualization as Code, BI as Code, or anything as code. YAML engineering, or DSL (Domain Specific Language)—no matter what, it&rsquo;s the way the data stack evolved to <strong>empower non-programmers</strong> to create data pipelines and infrastructure without losing the benefits of SW engineering best practices through declarative methods.</p>
<h2 id="what-is-a-declarative-data-stack">What Is a Declarative Data Stack?</h2>
<p>A declarative data stack is a set of tools and, precisely, its configs can be thought of as a <strong>single function</strong> such as <code>run_stack(serve(transform(ingest)))</code> that can recreate the entire data stack.</p>
<p>Instead of having one framework for one piece, we want a combination of multiple tools combined into a single <em>declarative</em> data stack. Like the Modern Data Stack, but integrated the way Kubernetes integrates all infrastructure into a <strong>single deployment</strong>, like YAML.</p>
<p>We focus on the end-to-end <a href="https://ssp.sh/brain/data-engineering-lifecycle" target="_blank" rel="noopener noreffer">Data Engineering Lifecycle</a>, from ingestion to visualization. But what does the combination with declarative mean? Think of <a href="https://ssp.sh/brain/functional-data-engineering" target="_blank" rel="noopener noreffer">Functional Data Engineering</a>, which leaves us in a place of <strong>confident reproducibility</strong> with little side effects (hopefully none) and uses <a href="https://www.ssp.sh/brain/idempotency" target="_blank" rel="noopener noreffer">idempotency</a> to restart function to recover and reinstate a particular state with conviction or rollback to a specific version.</p>
<div class="details admonition note open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-note"></i>Doesn&#39;t Exist Today<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">This doesn&rsquo;t exist like this today; some platforms come close, but they are either closed-source or only cover a subset of the complete data stack. See more on this in &ldquo;What are the Alternatives?&rdquo; in Part 2.</div>
        </div>
    </div>
<h2 id="why-a-declarative-data-stack">Why a Declarative Data Stack</h2>
<p>The opposite would be an imperative data stack. The imperative way works well for simple homogenous systems. It is flexible and gives us a high level of control. However, it has the downside of a hard-to-manage state, so automatic integrity checks or failure management are needed.</p>
<p>But it&rsquo;s not the right choice if you combine a heterogeneous data stack into a <strong>single data platform</strong>. The declarative approach manages complexity in a simple interface for non-technical users by abstracting away implementation details and focusing on the outcomes instead. It builds the foundation for a consistent, reproducible data platform across diverse components.</p>
<p>A declarative data stack cleanly separates semantics from implementation, <strong>decoupling</strong> the business code from technical implementation, the <em>how</em> is rendered by the data stack engine. This gives us a <strong>stateless stack</strong> that can be recreated from scratch with its configuration, setting aside intermediate data assets (stateful) produced along the way.</p>
<p>Think of Markdown. It is <strong>universally portable</strong>, and almost every application can render it with its various engines, such as HackMD, Obsidian, iA Writer, Neovim, and GitHub, to name a few. There is no need to fiddle with setting things up or configuring, as the Markup language defines everything—the engine decides/handles the rendering. The best part is that even images can be declaratively defined with <a href="https://www.ssp.sh/brain/mermaid" target="_blank" rel="noopener noreffer">Mermaid</a>. Similarly, the declarative stack is programming language agnostic.</p>
<p>Let&rsquo;s compare when to use which stack.</p>
<table>
  <thead>
      <tr>
          <th></th>
          <th>Declarative Data Stack</th>
          <th>Imperative Data Stack Management</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Core Concept</strong></td>
          <td>Define &lsquo;what&rsquo; you want, the system handles &lsquo;how&rsquo;</td>
          <td>Specify each step and process explicitly</td>
      </tr>
      <tr>
          <td><strong>Key Benefits</strong></td>
          <td>- Single deployment file for entire stack<br>- Version-controlled infrastructure<br>- Reproducible environments<br>- Separation of business logic from technical details</td>
          <td>- Granular control over components<br>- Flexibility for unique scenarios<br>- Easier optimization of individual parts</td>
      </tr>
      <tr>
          <td><strong>Challenges</strong></td>
          <td>- Learning curve for declarative syntax<br>- May require custom extensions for specific tools<br>- Debugging can be more complex<br></td>
          <td>- Manual configuration and maintenance<br>- Risk of configuration drift<br>- Time-consuming setup and modifications<br>- Inconsistencies between environments<br>- Hard to track changes and rollback</td>
      </tr>
      <tr>
          <td><strong>Empowerment</strong></td>
          <td>- Enables less technical users to manage complex data systems:<br>- Infrastructure as Code for data<br>- Automated dependency management<br>- Built-in data governance and security<br>- Integrated monitoring and observability</td>
          <td>Requires more technical expertise across the stack:<br>- Direct access to underlying systems</td>
      </tr>
      <tr>
          <td><strong>Best For</strong></td>
          <td>- Large-scale, consistent data operations<br>- Teams adopting software engineering practices (frequent iterations)<br>- Organizations prioritizing governance and compliance</td>
          <td>- Small to medium-scale projects<br>- Highly customized or unique workflows</td>
      </tr>
  </tbody>
</table>
<h3 id="yaml-the-language-of-declarative-configuration">YAML: The Language of Declarative Configuration</h3>
<p>YAML, Yet Another Markup Language, has become the configuration markup language for most modern tools. The reasons are simple: Compared to its predecessors, XML and JSON, which are still highly used, YAML is <strong>less verbose</strong>.</p>
<p>




<br>
Image comparison between XML, JSON, and YAML</p>
<p>Besides their different reasons for use, XML is designed to support structured documents, while JSON should be simple and universal and can quickly be processed; JSON has been mainly used for small data sets and REST services. YAML has similar threats but tries to be a superset of JSON, although every JSON can effectively be a valid YAML file.</p>
<p>But why is YAML used for declarative configurations? It supports lists and dictionaries with almost no overhead. YAML is optimized for reading extended configurations. Its descriptive and portable structure across different programming languages and its clear interface make it easy to maintain, read, and modify, making it well-suited for this task.</p>
<p>YAML is also used when implementing a <a href="https://en.wikipedia.org/wiki/Domain-specific_language" target="_blank" rel="noopener noreffer">DSL (Domain-Specific Language)</a> in your application. A DSL abstracts the complexity behind a system. It is a general-purpose language aimed at any software problem. Like Markdown or another example of HTML, it&rsquo;s programming language agnostic; the engine, the browser, does not care how you generated the HTML; it knows what to do with it. That&rsquo;s the goal of DSL, too.</p>
<div class="details admonition example open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-example"></i>Longevity<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">Also, if you are declarative, you can make the code work for a very long time. For example, the browser runs the code from Craigslist from 1995, and it&rsquo;s only possible to this day because HTML is declarative.</div>
        </div>
    </div>
<h2 id="which-components-form-an-end-to-end-data-stack">Which Components Form an End-to-End Data Stack?</h2>
<p>For it to be a complete data stack, we need to integrate data from its source systems, transform, aggregate, and clean data, and ultimately serve and visualize it, solving the <a href="https://www.dedp.online/part-1/1-introduction/challenges-in-data-engineering.html" target="_blank" rel="noopener noreffer">core challenges</a> in data engineering.</p>
<p>We cover all of these with the Data Engineering Lifecycle. Let&rsquo;s go through them one by one.</p>
<p>




<br>
Illustration of the data engineering lifecycle from the book <a href="https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/" target="_blank" rel="noopener noreffer">Fundamentals of Data Engineering</a></p>
<h3 id="ingestion">Ingestion</h3>
<p>Ingestion is the part that integrates the data sources into your data stack. Usually, your source types are OLTP databases, some files in a S3 bucket, or APIs.</p>
<p>Naturally, this is a very imperative process. You write a bunch of code, historically with procedural code or, in the modern world, with Python. But everything is writing the integration code, meaning the how. So, how do we get a more declarative approach? Wasn&rsquo;t there a declarative language that is quite popular?</p>
<p>Yes, SQL. SQL is declarative. You say which columns (<code>SELECT columns</code>) from which table (<code>FROM table</code>) and with some criteria (<code>WHERE</code>) and some aggregations (<code>GROUP BY</code>). The rest is done by the SQL engine, typically the database, the Spark cluster, or anything else.</p>
<p>The most straightforward way is using the SQL approach and wrapping <code>select * from s3://my-source-bucket/sales/*.parquet</code> into an orchestrator or using something like <a href="https://github.com/dlt-hub/dlt" target="_blank" rel="noopener noreffer">dlt</a>. dlt is, per se, not a declarative tool (except the REST API), but you can use it in a declarative fashion, e.g., configure a YAML to define all the columns and tables you want to ingest. Or you can use Airbyte and Dagster to create sources and destinations declaratively, essentially <a href="https://www.ssp.sh/blog/data-integration-as-code-airbyte-dbt-python-dagster/" target="_blank" rel="noopener noreffer">Data Integration as Code</a>. <a href="https://github.com/ibis-project/ibis" target="_blank" rel="noopener noreffer">Ibis</a> is another lightweight, universal interface for data transformation; it uses Python and, therefore, is not declarative out of the box. However, you can interchange your SQL engine, too, while writing the code only once.</p>
<h3 id="transformation">Transformation</h3>
<p>The most significant layer of any data stack is the transformation or ETL part. As this is primarily done in SQL, we are fine here, right?</p>
<p>Not really. As SQL has limitations, no variables, lots of duplications, hard to define best practices, many have come to love dbt. dbt is a nice wrapper around SQL that gives you super powers through templating with <a href="https://www.ssp.sh/brain/jinja-template/" target="_blank" rel="noopener noreffer">Jinja</a> and with goodies of documentation including lineage, versioning, and automating of your lost SQLs.</p>
<p>But with that, we are back to an imperative way. As dbt doesn&rsquo;t know anything about its SQLs, it <strong>just runs</strong> them; it&rsquo;s limited to doing anything data-aware.</p>
<p>Newer tools, especially <a href="https://github.com/TobikoData/sqlmesh" target="_blank" rel="noopener noreffer">SQLMesh</a> and its open-source SQL parser SQLGlot, can analyze queries, traverse expression trees, and programmatically build SQL. We can also use them for declarative transformation. Especially with SQLMesh, we get a better understanding of semantics. This will give us the power to describe our transformations declaratively, even with live compiler checks, before running anything.</p>
<p>With that, it can automatically pre-detect breaking changes and run a missing backfill. The best thing is that SQLMesh is backward and compatible with dbt.</p>
<h3 id="serving">Serving</h3>
<p>Visualizing and serving data in the correct format, whether as a data app, AI, or dashboard, is essential in your data stack. No matter how good your data quality and insights are, cleaning and transforming will be well-spent if you can present them in an easy-to-understand way.</p>
<p>Speaking of waste, the second wasteful time you can do is recreate the same dashboards repeatedly and change the measure on each so very slightly. This is where the power of declarative hits for visualization. Could you copy and paste the definition of a dashboard and replace the measure? That is precisely what we do with Rill or Evidence (interestingly with Markdown).</p>
<p>Here&rsquo;s a declarative dashboard example in Rill:<br>





</p>
<p>This example, <strong>visualization as code</strong>, takes this declarative approach further, allowing us to define visualizations using structured data modeling syntax. The <a href="https://towardsdatascience.com/a-comprehensive-guide-to-the-grammar-of-graphics-for-effective-visualization-of-multi-dimensional-1f92b4ed4149" target="_blank" rel="noopener noreffer">Grammar of Graphics</a> is a related concept, a theoretical framework that breaks down statistical graphics into semantic components. Tools like ggplot2 in R and <a href="https://vega.github.io/vega-lite/" target="_blank" rel="noopener noreffer">Vega-Light</a> implement this grammar, enabling users to describe the relationships between data and visual elements rather than specifying how to draw the chart.</p>
<p>By embracing these declarative approaches to visualization, we can create more maintainable, flexible, and robust data presentations as part of our declarative data stack, focusing on what we want to communicate rather than the intricacies of how to draw it.</p>
<p>An excellent feature that can be integrated into a declarative approach is <a href="https://docs.rilldata.com/reference/project-files/rill-yaml#testing-access-policies" target="_blank" rel="noopener noreffer">access permissions</a> with <strong>access policies</strong>.</p>
<div class="details admonition tip open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-tip"></i>A Fascinating Idea Is to Extend SQL with Analytical Capabilities<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">As Julian Hyde explains in his talks about Cubing and essentially <a href="https://www.youtube.com/watch?v=oo1uwJ3qHwE" target="_blank" rel="noopener noreffer">Extending SQL for Analytics</a>, SQL could be extended to incorporate a metrics or semantic layer directly into the language. This extension includes adding measures to SQL tables, allowing queries to return measures (expressions) instead of values, and introducing new syntax for cross-dimensional calculations. These extensions make complex analytical queries as concise and intuitive as natural language questions, potentially transforming how we approach data analysis and business intelligence. This approach aligns with the declarative nature of SQL and could further bridge the gap between data storage, analysis, and visualization in a unified, declarative framework.</div>
        </div>
    </div>
<div class="details admonition note open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-note"></i>How Does a Semantic Layer Play Into This?<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">A <a href="https://ssp.sh/brain/semantic-layer" target="_blank" rel="noopener noreffer">Semantic Layer</a> serves SQL and visualizes measures, and it can be a way of declaratively defining metrics and dimensions.</div>
        </div>
    </div>
<h3 id="undercurrents">Undercurrents</h3>
<p>In addition to these three main components of a data stack, there are some undercurrents across the data stack. We&rsquo;ll focus only on three and keep it brief, as this could be another blog post.</p>
<h4 id="orchestration">Orchestration</h4>
<p>Orchestration is the central piece that manages all moving parts. If you will, the orchestrator is the <strong>engine</strong> in which the <em>how</em> of your declarative data stack can be implemented. You can write the technical logic of fetching data from an API service, partition your data to optimize speed, and integrate different tools of your stack into a <strong>single data platform</strong>.</p>
<p>Orchestrators that already conform to a declarative way are Dagster, Kestra, KubeFlow, and many more. But as with browsers, you can have different engines, different orchestration tools, or even self-written code and small lambda functions. Even all this is up to you as the data stack engine architect.</p>
<h4 id="security">Security</h4>
<p>Security would greatly benefit from a declarative data stack, as you can configure your access rights at a central place and read that configuration for each of the components of your data stack instead of implementing it again at each layer.</p>
<h4 id="dataops">DataOps</h4>
<p>DataOps, as part of the data engineering lifecycle, could be seen as the place to manage your data stack engine. If we look at what DataOps is, a combination of efficiency, velocity, usability, and automation, this is precisely what the declarative data stack is about.</p>
<h2 id="what-about-the-data-assets">What About the Data Assets?</h2>
<p>Before we wrap it up, let&rsquo;s connect critical pieces at the interception where <strong>stateful</strong> and <strong>stateless</strong> meet: <em>Data Assets</em>, also called <em>Data Products</em> (e.g. in Data Mesh).</p>
<p>What are data assets? These are stateful assets that hold data. Assets can be a Parquet file on S3, a Delta/Postgres table, or a data mart on your OLAP cube.</p>
<p>This is the only part of the declarative data stack that cannot be configured or defined by a function, as upstream data is constantly changing and outside our control.</p>
<p>However, in an ideal data stack, we can always recreate data assets from source data if we perform a full/initial load every time. The moment we perform an incremental load, it gets more complex.</p>
<p>An interesting approach is the <a href="https://www.ssp.sh/brain/software-defined-asset" target="_blank" rel="noopener noreffer">Software-Defined Asset</a>, which adds declarative capabilities to your stateful assets—for example, defining to daily update the assets. That would check if downstream events have changed or automatically versioning data assets. These are all things you can implement and attach to your data assets, which makes even the data asset behave more declarative.</p>
<p>You could imagine writing a <em>data stack engine</em> for all data assets that automatically backfills required data assets if the declaration of your data stack has changed. All while keeping the idempotency and functional data engineering paradigm in mind.</p>
<p>This significantly simplifies data governance, as rules could be embedded in the assets and even automatically censor sensitive data. There is no need to add security rules, as data gets censored automatically. I could make countless more examples, but you get the point.</p>
<h3 id="durable-state-vs-ephemeral-state">Durable State vs. Ephemeral State</h3>
<p>We can further categorize the state into <strong>durable</strong> and <strong>ephemeral</strong>. The durable state, represented by our data assets, persists across system restarts and is crucial for maintaining long-term data integrity. On the other hand, the ephemeral state is temporary and often exists only during the execution of our declarative processes.</p>
<p>The challenge in a declarative system is managing the interaction between these state types. While our declarative configurations handle ephemeral states elegantly, the durable state (our data assets) requires special consideration. This is where concepts like Software-Defined Assets come into play, too, allowing us to apply declarative principles to durable state management.</p>
<h3 id="low-code-vs-code-first">Low Code vs. Code First</h3>
<p>An interesting parallel is that this is also where low/no-code vs. code first comes from. Once you have a declarative data stack, you essentially have your low-code or no-code platform, as you can build an API/UI on top of your configs, and every user could change the data stack, assuming that the features are implemented and exposed as configuration.</p>
<h3 id="maslows-hierarchy-of-state-in-declarative-data-stacks">Maslow&rsquo;s Hierarchy of State in Declarative Data Stacks</h3>
<p>Drawing inspiration from Maslow&rsquo;s hierarchy of needs, we can conceptualize the hierarchies of state in declarative data systems in three levels, progressing from static to dynamic:</p>
<ol>
<li><strong>Basic Declarative State</strong>: The foundation consists of all declarations needed to recreate the data stack. This includes configurations, schemas, transformations, etc. It&rsquo;s all <em>static</em>.</li>
<li><strong>Asset-added State</strong>: Building on the basic state, we incorporate data assets. These can include source data, processed datasets, dashboards, reports, and other artifacts that represent the actual data and the <em>state</em> of the data stack.</li>
<li><strong>Self-Actualizing State</strong>: The data stack becomes smarter and partially autonomous at the pinnacle. It can dynamically adapt to changes, such as automatically detecting new source data and generating appropriate declarative configs to integrate it into the entire stack. While the process remains deterministic based on the environment, it behaves <em>dynamically</em>.</li>
</ol>
<p>As we move up this hierarchy, we increase the features and autonomy of our data stack, reducing manual intervention and increasing system &ldquo;intelligence.&rdquo;</p>
<p>Closed-source cloud data stacks usually achieve the self-actualizing state, as they have all the <strong>metadata</strong> and control the whole platform. But I believe this is also possible with an open declarative data stack. By centralizing the settings of the entire data stack in a single Git repository and ensuring end-to-end integration with comprehensive metadata, we can achieve similar levels of declarative deployment and setup. This approach allows for idempotent recreation of state across the entire data stack, mirroring the capabilities of closed-source platforms while maintaining openness and flexibility.</p>
<h2 id="the-declarative-data-stack-engine">The Declarative Data Stack Engine</h2>
<p>Let&rsquo;s see what that could look like and what the data stack engine <strong>requires to implement</strong>. If we think in functional terms, we need to define the whole data stack with a single function:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">run_stack</span><span class="p">(</span><span class="n">serve</span><span class="p">(</span><span class="n">transform</span><span class="p">(</span><span class="n">ingest</span><span class="p">)))</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>In an ideal scenario, we could run the whole stack declaratively with this:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">run_stack</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">serve</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">template</span><span class="o">=</span><span class="s1">&#39;github://covid/covid_dashboard.md&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="n">data</span><span class="o">=</span><span class="n">transform</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="s1">&#39;groupby&#39;</span><span class="p">:</span> <span class="s1">&#39;country&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s1">&#39;aggregate&#39;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                    <span class="s1">&#39;cases&#39;</span><span class="p">:</span> <span class="s1">&#39;sum&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                    <span class="s1">&#39;deaths&#39;</span><span class="p">:</span> <span class="s1">&#39;sum&#39;</span>
</span></span><span class="line"><span class="cl">                <span class="p">}</span>
</span></span><span class="line"><span class="cl">            <span class="p">},</span>
</span></span><span class="line"><span class="cl">            <span class="n">ingest</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">                <span class="n">source</span><span class="o">=</span><span class="s1">&#39;duckdb&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="n">query</span><span class="o">=</span><span class="s1">&#39;SELECT * FROM &#34;s3://coviddata/covid_*.parquet&#34;&#39;</span>
</span></span><span class="line"><span class="cl">            <span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>This example:</p>
<ol>
<li>Ingests data from a Parquet file in S3 using DuckDB</li>
<li>Transforms the data by grouping it by country and summing cases and deaths</li>
<li>Serves the transformed data using a dashboard template from GitHub</li>
</ol>
<p>The whole definition and functions a declarative data stack would need to implement could look something like this:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">run_stack</span><span class="p">(</span><span class="n">serve_function</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">serve_function</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">serve</span><span class="p">(</span><span class="n">template</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Logic to render the template with the provided data</span>
</span></span><span class="line"><span class="cl">    <span class="k">pass</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">transform</span><span class="p">(</span><span class="n">transform_function</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">transform_function</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">ingest</span><span class="p">(</span><span class="n">source</span><span class="p">,</span> <span class="n">query</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Logic to execute the query on the specified source</span>
</span></span><span class="line"><span class="cl">    <span class="k">pass</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Note: I intentionally omitted detailed orchestration, as it could be the easiest way to implement such a data stack engine.</p>
<div class="details admonition note open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-note"></i>Side-Note about Validation<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">Each layer (ingestion, serving, transformation, etc.) should be able to validate itself, and the data stack engine would validate and stitch the missing pieces together. I could imagine having data contracts between the different interfaces, but that would be an implementation detail, and it is up to the engine how they want to implement it.</div>
        </div>
    </div>
<h2 id="whats-next">What&rsquo;s Next?</h2>
<p>In Summary, we need three things to recreate a declarative data stack from scratch:</p>
<ol>
<li>Exogenous/external data sources</li>
<li>Declarative artifacts</li>
<li>Rendering engine</li>
</ol>
<p>Integrating everything into one git repo for all common configs (and metadata) will allow us to achieve things similar to those of closed-source data platforms.</p>
<p>This is part one of a series. Next, in <a href="/blog/bi-as-code-and-genbi/" rel="">Part 2</a>, we&rsquo;ll explore the concept of &ldquo;BI-as-code&rdquo; and its effectiveness as an interface for GenBI (Generative Business Intelligence). We&rsquo;ll begin with delving into why a declarative approach is a prerequisite for GenAI. We&rsquo;ll also examine alternatives to the declarative data stack, providing a comprehensive view of the landscape. This part will highlight how the declarative paradigm is not just a trend but a fundamental shift in how we approach business intelligence in the age of AI.</p>
<p>Part 3 will take a practical turn as we attempt to build a declarative data stack example. Show how to implement this stack using AI, as declarative stacks allow a generative approach. The goal is to iterate and get one step closer to the open declarative data stack.</p>
<h2 id="further-reading">Further Reading</h2>
<h3 id="the-declarative-mindset-in-data-stacks">The Declarative Mindset in Data Stacks</h3>
<ol>
<li>Pedram Navid. &ldquo;The Rise of the Data Platform Engineer.&rdquo; Dagster Blog. <em>An in-depth look at the evolving role of data platform engineers in modern data architectures.</em> <a href="https://dagster.io/blog/rise-of-the-data-platform-engineer" target="_blank" rel="noopener noreffer">Read the article</a></li>
<li>Benoit Pimpaud. &ldquo;ELT with Kestra, DuckDB, dbt, Neon and Resend.&rdquo; Medium. <em>A practical guide to implementing an ELT pipeline using various modern data tools.</em> <a href="https://medium.pimpaudben.fr/elt-with-kestra-duckdb-dbt-neon-and-resend-5bfd62160190" target="_blank" rel="noopener noreffer">Read the article</a></li>
<li>Archie Wood. &ldquo;An end-to-end data stack with just DuckDB: ETL is dead, long live ETV.&rdquo; LinkedIn. <em>Explores the concept of Extract-Transform-Visualize (ETV) using DuckDB as the primary tool.</em> <a href="https://www.linkedin.com/posts/archiesarrewood_an-end-to-end-data-stack-with-just-duckdb-activity-7245362448545808385-ioK3" target="_blank" rel="noopener noreffer">Read the post</a></li>
</ol>
<h3 id="code-reusability-in-data-transformation">Code Reusability in Data Transformation</h3>
<ol start="4">
<li>Maxime Beauchemin. &ldquo;Why Data Teams Keep Reinventing the Wheel: The Struggle for Code Reuse in the Data Transformation Layer.&rdquo; Preset. <em>Discusses the challenges and importance of code reusability in data transformation, introducing the concept of Parametric Data Pipelines.</em> <a href="https://preset.io/blog/why-data-teams-keep-reinventing-the-wheel/" target="_blank" rel="noopener noreffer">Read the article</a></li>
</ol>
<hr>
<pre class=""><em>Full article published at <a href="https://www.rilldata.com/blog/the-rise-of-the-declarative-data-stack" target="_blank" rel="noopener noreferrer">Rilldata.com</a> - written as part of <a href="/services">my services</a></em></pre>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Markdown plaintext files, Obsidian, and GitHub Discussions were the most loved async tools for developers in 2024 according to <a href="https://survey.stackoverflow.co/2024/technology#2-asynchronous-tools" target="_blank" rel="noopener noreffer">StackOverflow Survey</a>.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></description>
</item>
<item>
    <title>Data Modeling - The Unsung Hero of Data Engineering: Architecture Pattern, Tools and the Future (Part 3)</title>
    <link>https://www.ssp.sh/blog/data-modeling-for-data-engineering-architecture-pattern-tools-future/</link>
    <pubDate>Fri, 26 May 2023 22:17:22 &#43;0200</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/data-modeling-for-data-engineering-architecture-pattern-tools-future/</guid><enclosure url="https://www.ssp.sh/blog/data-modeling-for-data-engineering-architecture-pattern-tools-future/images/data-modeling-architecture-pattern.jpg" type="image/jpeg" length="0" /><description><![CDATA[<p>Welcome to the third and final installment of our series &ldquo;Data Modeling: The Unsung Hero of Data Engineering.&rdquo; If you’ve journeyed with us from <a href="/blog/data-modeling-for-data-engineering-introduction/" rel="">Part 1</a>, where we dove into the importance and history of data modeling, or joined us in <a href="/blog/data-modeling-for-data-engineering-approaches-techniques/" rel="">Part 2</a> to explore various approaches and techniques, I’m delighted you’ve stuck around.</p>
<p>In this third part, we’ll delve into data architecture patterns and their influence on data modeling. We’ll explore general and specialized patterns, debating the merits of various approaches like batch vs. streaming and lakehouse vs. warehouse, and the role of a semantic layer in complex data architecture.</p>
<p>We’ll also survey the landscape of data modeling tools, comparing commercial and open-source options, and ponder the potential of AI in data modeling. To wrap up, we’ll introduce data modeling frameworks like ADAPT™ and BEAM, designed to guide effective data model creation. Please join me as we take this exciting journey toward understanding data architecture better.</p>
<h2 id="data-architecture-pattern-with-data-modeling">Data Architecture Pattern with Data Modeling</h2>
<p>Bringing together the best of the individual approaches and techniques and knowing the common problems, we must always <strong>keep the bigger data architecture picture in mind</strong>. Sometimes you want a data vault modeling for your first layer when your source system is constantly changing, or you need a dimensional model in your last layer to build data apps on top of it or allow self-serve analytics. But how are you doing this?</p>
<p>For these, we need to look at data architecture per se. In this chapter, I will list some of the most common architectural patterns I’ve seen, but without going into all details of each.</p>
<h2 id="general-purpose-data-architecture-pattern-medallion-core-etc">General Purpose Data Architecture Pattern (Medallion, Core, etc.)</h2>
<p>Let’s start with the one that suited me best, the one you have seen in one form or another. When I started at a consulting firm called <a href="https://trivadis.com/" target="_blank" rel="noopener noreffer">Trivadis</a>, and here it was called &ldquo;Foundational Architecture of Data Warehouse&rdquo;. We followed best practices such as <code>staging&gt;cleansing&gt;core&gt;mart&gt;BI</code>.</p>













  

























<figure>
<a target="_blank" href="/blog/data-modeling-for-data-engineering-architecture-pattern-tools-future/images/data-warehouse-blueprint.png" title="/blog/data-modeling-for-data-engineering-architecture-pattern-tools-future/images/data-warehouse-blueprint.png">

</a><figcaption class="image-caption">Fundamental Data Architecture of a Data Warehouse | Image from <a href="https://www.amazon.de/Data-Warehouse-Blueprints-Business-Intelligence/dp/3446450750" target="_blank" rel="noopener noreffer">Data Warehouse Blueprints</a>, September 2016</figcaption>
</figure>
<p>We would design these layers from the top-down approach discussed above and decide the data modeling technique for each layer depending on requirements.</p>
<p>Let’s have a detailed look at each layer, as these are fundamental for every data architecture project, and they will help us understand why we’d want to model different layers differently. The following layers or areas belong to a complete Data Warehouse (DWH) architecture but can be implemented into any data lake or analytics product you use.</p>
<h3 id="staging-area">Staging Area</h3>
<p>Data from various source systems is first loaded into the Staging Area.</p>
<ul>
<li>In this first area, the data is stored as it is delivered; therefore, the stage tables’ structure corresponds to the interface to the source system.</li>
<li>No relationships exist between the individual tables.</li>
<li>Each table contains the data from the final delivery, which will be deleted before the next delivery.</li>
<li>For example: In a grocery store, the Staging Area corresponds to the loading dock where suppliers (source systems) deliver their goods (data). Only the latest deliveries are stored there before being transferred to the next area.</li>
</ul>
<h3 id="cleansing-area">Cleansing Area</h3>
<p>It must be cleaned before the delivered data is loaded into the Core. Most of these cleaning steps are performed in the Cleansing Area.</p>
<ul>
<li>Faulty data must be filtered, corrected, or complemented with singleton (default) values.</li>
<li>Data from different source systems must be transformed and integrated into a unified form.</li>
<li>This layer also contains only the data from the final delivery.</li>
<li>For example: In a grocery store, the Cleansing Area can be compared to the area where the goods are commissioned for sale. The goods are unpacked, vegetables and salad are washed, the meat is portioned, possibly combined with multiple products, and everything is labeled with price tags. The quality control of the delivered goods also belongs in this area.</li>
</ul>
<h3 id="core">Core</h3>
<p>The data from the different source systems are brought together in a central area, the Core, through the Staging and Cleansing Area and stored there for extended periods, often several years.</p>
<ul>
<li>A primary task of the Core is to integrate the data from different sources and store it in a thematically structured way rather than separated by origin.</li>
<li>Often, thematic sub-areas in the Core are called &ldquo;Subject Areas.&rdquo;</li>
<li>The data is stored in the Core so that historical data can be determined at any later point in time.</li>
<li>The Core should be the only data source for the Data Marts.</li>
<li>Direct access to the Core by users should be avoided as much as possible.</li>
</ul>
<h3 id="data-marts">Data Marts</h3>
<p>In the Data Marts, subsets of the data from the Core are stored in a form suitable for user queries.</p>
<ul>
<li>Each Data Mart should only contain the data relevant to each application or a unique view of the data. This means several Data Marts are typically defined for different user groups and BI applications.</li>
<li>This reduces the complexity of the queries, increasing the acceptance of the DWH system among users.</li>
<li>For example, The Data Marts are the grocery store’s market stalls or sales points. Each market stand offers a specific selection of goods, such as vegetables, meat, or cheese. The goods are presented so that they are accepted, i.e., purchased, by the respective customer group.</li>
</ul>
<div class="details admonition note open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-note"></i>And as a Foundation, we have Metadata<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">Different types of metadata are needed for the smooth operation of the Data Warehouse. Business metadata contains business descriptions of all attributes, drill paths, and aggregation rules for the front-end applications and code designations. Technical metadata describes, for example, data structures, mapping rules, and parameters for ETL control. Operational metadata contains all log tables, error messages, logging of ETL processes, and much more. The metadata forms the infrastructure of a DWH system and is described as &ldquo;data about data&rdquo;.</div>
        </div>
    </div>
<h3 id="not-every-architecture-is-the-same">Not every architecture is the same</h3>
<p>Only some data warehouses or data engineering projects have precisely this structure. Some areas are combined, such as the Staging and Cleansing areas, or differently named. The Core is sometimes referred to as the &ldquo;Integration Layer&rdquo; or &ldquo;(Core) Data Warehouse.&rdquo;</p>
<p>However, the overall system must be divided into different areas to decouple the other tasks, such as data cleaning, integration, historization, and user queries. In this way, the complexity of the transformation steps between the individual layers can be reduced.</p>
<h3 id="why-use-it-today">Why use it today?</h3>
<p>Isn’t it amazing that something from 2016 is still so current? That’s why data modeling is getting into vogue again, as it has never been entirely outdated.</p>
<p>Databricks renamed these layers with <strong>bronze, silver, and gold</strong> to understand it may be a little better and called it <a href="https://www.databricks.com/glossary/medallion-architecture" target="_blank" rel="noopener noreffer">Medallion Architecture</a>, but it’s something every BI engineer works with every day. In essence, it’s the same concept.</p>
<p>Or if we look at the <a href="/brain/data-engineering-lifecycle" rel="">Data Engineering Lifecycle</a> introduced by the <a href="https://www.amazon.com/Fundamentals-Data-Engineering-Robust-Systems/dp/1098108302" target="_blank" rel="noopener noreffer">Fundamentals of Data Engineering</a>, we see a similar picture but on an even higher level. You could apply the different layers to the <em>Storage</em> layer in the image below.</p>













  

























<figure>
<a target="_blank" href="/blog/data-modeling-for-data-engineering-architecture-pattern-tools-future/images/data-engineering-lifecycle.png" title="/blog/data-modeling-for-data-engineering-architecture-pattern-tools-future/images/data-engineering-lifecycle.png">

</a><figcaption class="image-caption">The data engineering lifecycle, by <a href="https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/" target="_blank" rel="noopener noreffer">Fundamentals of Data Engineering</a></figcaption>
</figure>
<h2 id="specialized-data-architecture-patterns">Specialized Data Architecture Patterns</h2>
<p>In this chapter, we look at patterns that might not be considered in formal data modeling, but each of its decisions will highly influence the data modeling part. Besides the general data architecture, I call these specialized data architecture patterns as these are higher-level data architecture decisions.</p>
<h3 id="batch-vs-streaming">Batch vs. Streaming</h3>
<p>An obvious decision you need to take early on is if you need real-time data for critical data application or batch with near real-time micro batching every minute, the hour is enough.</p>
<p>Still, to this day, steaming is primarily optional. Suppose you tell the business team you can achieve near real-time with hourly batching; they will be happy. The latest up-to-date data will little influence your data analytics. Most cases are looking at historical data anyway.</p>
<p>Nevertheless, some business-critical data solutions need real-time. However, be aware that the <strong>effort and challenges will be much more significant</strong> as you can only partially recover if a stream fails. But the good idea is to set up your data pipeline as even based on the streaming approach, and therefore can go lower and lower with the latency of your batch to achieve near real-time.</p>
<h3 id="data-lakelakehouse-vs-data-warehouse-pattern">Data Lake/Lakehouse vs. Data Warehouse Pattern</h3>
<p>As Srinivasan Venkatadri <a href="https://www.linkedin.com/feed/update/urn:li:activity:6991977435017728000?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A6991977435017728000%2C6993007322495148032%29" target="_blank" rel="noopener noreffer">says</a> correctly: &ldquo;Data modeling, in general, should also talk about files (open formats), <a href="/brain/data-lakehouse" rel="">Lakehouses</a>, and techniques to convert or extract structural data from semi or unstructured would help.&rdquo; This is precisely where data lake/lakehouse or data warehouse patterns apply. With these, you need to think about these questions.</p>
<p>With the long fight between <a href="/brain//elt/" rel="">ELT</a> with Data Lakes and traditional <a href="/brain/etl/" rel="">ETL</a> with Data Warehouses, the different architectures greatly influence the data modeling architecture.</p>
<p>I will only go into some detail here, as I have <a href="/blog/data-lake-lakehouse-guide/" rel="">written extensively</a> about these topics. But the decision to go with a data lake or lakehouse will likely influence the data modeling to a more open source strategy and tools. Also, as you are dumping files into your lake, you need to work with large distributed files, where <a href="/brain/data-lake-table-format" rel="">Table Format</a> makes a lot of sense, to get database-like features on top of files.</p>
<h3 id="semantic-layer-in-memory-vs-persistence-or-semantic-vs-transformation-layer">Semantic Layer (In-memory vs. Persistence or Semantic vs. Transformation Layer)</h3>
<p>The data lake decision is also highly influential with the new <a href="/blog/rise-of-semantic-layer-metrics/" rel="">Rise of the Semantic Layer</a>, where you run a logical layer on top of your files in the case of a data lake and the possibility of a data warehouse on top of your data marts.</p>
<p>This is an interesting one, as we talked in <a href="/blog/data-modeling-for-data-engineering-introduction/" rel="">part 1</a> and <a href="/blog/data-modeling-for-data-engineering-approaches-techniques/" rel="">part 2</a> about the <a href="/blog/data-modeling-for-data-engineering-approaches-techniques/#data-modeling-approaches" rel="">Conceptual, Logical, and Physical Data Models</a> where a <a href="/brain/semantic-layer/" rel="">Semantic Layer</a> could replace the logical model. The semantic layer stores the queries in declarative YAMLs, and data not persisted in physical storage is executed at run time or when the data is fetched. <strong>Integrating the semantic layer into complex data architecture and overall data modeling makes sense</strong>. It can serve a wide range of data consumers with different requirements, such as direct SQL, API (REST), or even GraphQL, with the great benefit of writing business logic only once. Here again, as business logic is the biggest treasure of data modeling, we have many new options to model our data with semantic layers.</p>
<p>The counter architecture pattern is to go with a <strong>transformation layer</strong> that typically can be achieved with dbt or a <a href="/brain/data-orchestrators" rel="">Data Orchestrator</a>, where you persist each step into physical storage. Mainly to gain faster speed querying these data sets and reusing the data sets with other marts or data apps. The transformation layer and its <strong>transformational modeling</strong> make a decision between table formats vs. <a href="https://glossary.airbyte.com/term/in-memory-format/" target="_blank" rel="noopener noreffer">in-memory formats</a>, <a href="https://glossary.airbyte.com/term/push-down/" target="_blank" rel="noopener noreffer">query push-downs</a>, and <a href="/brain/data-virtualization" rel="">data virtualization</a>.</p>
<div class="details admonition abstract open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-abstract"></i>Materialized Views<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">A transformation layer where you persist data is very similar to what materialized views did before the <a href="https://airbyte.com/blog/data-engineering-past-present-and-future" target="_blank" rel="noopener noreffer">modern era of data engineering</a>.</div>
        </div>
    </div>
<h3 id="modernopen-data-stack-pattern">Modern/Open Data Stack Pattern</h3>
<p>Next up is <a href="/brain/modern-data-stack" rel="">Modern Data Stack</a> architecture, or <a href="https://www.ssp.sh/brain/open-data-stack/" target="_blank" rel="noopener noreffer">Open Data Stack</a>. This pattern is basically for <a href="https://mad.firstmark.com/" target="_blank" rel="noopener noreffer">sheer choice</a> we have nowadays to choose from the open data stack. It is its architecture to choose the right tool best suited for the requirements at hand in the company.</p>
<p>It matters if you chose an open source tool, v0.5, that might get out of order in a couple of years or if you chose the v0.1 that made the crackdown the line. But these are complicated bets, and only experienced data architects and people working in the field can make decisions based on intuition for the right way a project is progressing.</p>
<p>It is also to say no to yesterday’s new shiny tool, resist the urge, and wait until the product is more mature. But at the same time to take risks and bet on a tool that is open to success and embraces open-source philosophy instead of building a worse copy in-house. Each enterprise company either uses a closed-source solution to handle the <a href="/brain/data-engineering-lifecycle/" rel="">data engineering lifecycle</a> or makes its stack or newer option, betting on a developed framework in the open.</p>
<h3 id="imperative-vs-declarative-pattern">Imperative vs. Declarative Pattern</h3>
<p>The <a href="/brain/declarative/" rel="">declarative</a> approach is manifesting itself more and more. Started within the fronted revolution where react was declaring components, <a href="/brain/kubernetes/" rel="">Kubernetes</a> did it for <a href="/brain/dev-ops/" rel="">DevOps</a>, Dagster <a href="/blog/data-orchestration-trends/" rel="">revolutionized orchestration</a> with <a href="/brain/software-defined-asset" rel="">Software-Defined Assets</a> or us at Airbyte, where we created a <a href="https://docs.airbyte.com/connector-development/config-based/low-code-cdk-overview/" target="_blank" rel="noopener noreffer">Low-Code Connector developer kit</a> to create data integrations in minutes by filling out a YAML file.</p>
<p>It is only possible because of drastically reduced complexity and the need to write many boilerplates. The declarative way describes <em>what</em> and the <em>how</em> is taken care of in the framework.</p>
<p>I have much more to say, but you can read more on our data glossary on declarative and how it relates to <a href="/brain/functional-data-engineering" rel="">functional data engineering</a> and what the opposite, <a href="/brain/imperative/" rel="">imperative</a> pattern is.</p>
<h3 id="notebooks-vs-data-pipelines-pattern">Notebooks vs. Data Pipelines Pattern</h3>
<p>A pattern directly related to <a href="/brain/data-orchestrators" rel="">orchestration</a> is the notebooks versus data pipeline pattern, where you can write and run a pipeline solely a <a href="/brain/notebooks/" rel="">notebook</a> or a mix of mature, unit-tested data pipelines with a stable orchestrator with all bells and whistles included.</p>
<h3 id="centralized-and-decentralized-pattern">Centralized and Decentralized Pattern</h3>
<p>Suppose you are in the field for a while. How many cycles have you been through from doing everything server side, switching everything to server rendering, and forth and back with the battle of client vs. server, microservices vs. monoliths, and lately, central cloud data warehouses vs. a yet-to-show decentralized <a href="/brain/data-mesh/" rel="">Data Mesh</a>. And many more forth and back will follow.</p>
<p>Whatever you choose, start with <strong>simplicity</strong>. You can always add complexity later when the product and solution mature.</p>
<h3 id="everything-else">Everything else</h3>
<p>There are so many more patterns I will write about but at some time later. For now, I leave you with pointers above, and I’ll come back to it sometime later.</p>
<h2 id="data-modeling-tools">Data Modeling Tools</h2>
<p><strong>Popular data modeling tools</strong> include <a href="https://sqldbm.com/" target="_blank" rel="noopener noreffer">Sqldbm</a>, <a href="https://dbdiagram.io/" target="_blank" rel="noopener noreffer">DBDiagrams</a>, <a href="https://sparxsystems.com/" target="_blank" rel="noopener noreffer">Enterprise Architect</a>, and <a href="https://www.sap.com/products/technology-platform/powerdesigner-data-modeling-tools.html" target="_blank" rel="noopener noreffer">SAP PowerDesigner</a>. These tools are widely used in the industry and offer powerful features such as data modeling, profiling, and visualization.</p>
<p><strong>Open-source data modeling tools</strong> such as <a href="https://www.mysql.com/products/workbench/" target="_blank" rel="noopener noreffer">MySQL Workbench</a> and <a href="http://www.modelsphere.com/" target="_blank" rel="noopener noreffer">OpenModelSphere</a> are free and offer essential features for creating data models. They are helpful for small projects and provide an opportunity for data engineers to learn data modeling skills.</p>
<p><strong>Choosing the right data modeling tool</strong> depends on the organization’s needs, budget, and project size. Large organizations may require expensive enterprise-level tools, while small businesses may opt for open-source tools. Selecting a tool that is easy to use, has the needed features, and is compatible with the organization’s database management system is essential.</p>
<p>Other tools are <a href="https://www.ellie.ai/" target="_blank" rel="noopener noreffer">Ellie.ai</a>, whose key features are Data Product Design, Data Modeling, Business Glossary, Collaboration, Reusability, and Open API.</p>
<p>dbt can be seen as a transformation modeling tool. <a href="/brain/dagster/" rel="">Dagster</a> can be used as a <a href="/brain/dag" rel="">DAG</a> modeling tool. And so forth. But you can also use <a href="https://excalidraw.com/" target="_blank" rel="noopener noreffer">ExaliDraw</a> for Markdown-based drawing or <a href="https://draw.io/" target="_blank" rel="noopener noreffer">draw.io</a> (lots of <a href="https://www.drawio.com/example-diagrams" target="_blank" rel="noopener noreffer">templates</a> for AWS, Azure, etc.) to draw architectures.</p>
<p>If you struggle to think in dbt tables and <a href="/brain/sql/" rel="">SQL</a> is not the SQL is not the right language. One problem, SQL is a <a href="/brain/declarative/" rel="">declarative</a> language, which is a blessing and a curse. Especially if you do recurring queries, SQL gets nasty spaghetti coded, which again dbt helps with <a href="/brain/jinja-template" rel="">Jinja Templates</a>, but as it’s not a language, without much in-built support. <a href="https://reconfigured.io/" target="_blank" rel="noopener noreffer">Reconfigured</a> (not free) was built for people without years of experience, focusing heavily on business logic.</p>
<h3 id="what-about-chatgpt">What about ChatGPT?</h3>
<p>With all the hype of generative AI, specifically ChatGPT, it asks if AI can model our data.</p>
<p>If we recap the information from this series, most of it condenses into translating business requirements into a data model or semantics, also called business logic. As Chad Sanderson mentioned in his <a href="https://www.linkedin.com/posts/chad-sanderson_dataengineering-activity-7045442450139607040-IrA6?utm_source=share&amp;utm_medium=member_desktop," target="_blank" rel="noopener noreffer">post</a>:</p>
<blockquote>The hard part of data development is understanding how code <strong>translates to the real world</strong>. Every business has a unique way of storing data. One customer ID could be stored in a MySQL DB. Another could be imported from Mixpanel as nested JSON, and a third might be collected from a CDP. All three IDs and their properties are slightly (or significantly) different and must be integrated.</blockquote>
<p>As Chad continues, and I strongly agree, he says, &ldquo;As smart as ChatGPT might be, it would need to <strong>understand the semantics</strong> of how these IDs coalesce into something meaningful to automate any step of modeling or ETL. The algorithm must understand the real world, grok how the business works, and dynamically tie the data model to that understanding.&rdquo;</p>
<p>This is not AI or ML anymore; that is more intuition, long-term experience in working in the field, and experiencing some of the challenges to get a feeling for it.</p>
<p>If there is a place where you do not want to use automation, then it would be in data modeling and the overall data architecture. Here we need discussions and a deep understanding (domain knowledge) of the field and data modeling.</p>
<p>On the other hand, based on your data model, you can use AI to generate schemas or the physical model of a database. Brainstorm with ChatGPT based on your created data model if something is missing or if it sees something that can be changed. Sometimes you get great insights from it by providing the solution you came up with.</p>
<p>Maybe most importantly, a data engineer will always be needed to assess the outcome, understand and make sense of the data in front of him and understand the overall data flow of the organization. That can only be delegated for a while.</p>
<h2 id="data-modeling-frameworks">Data Modeling Frameworks</h2>
<p>Besides tools, there are also helpful frameworks that help you model your data, asking the right questions.</p>
<h3 id="adapt">ADAPT™</h3>
<p><a href="http://www.symcorp.com/downloads/ADAPT_white_paper.pdf" target="_blank" rel="noopener noreffer">ADAPT</a> says that more than existing data modeling techniques like ER and dimensional modeling is required for OLAP database design. That’s why ADAPT is a modeling technique <strong>designed specifically for OLAP databases</strong>. It addresses the unique needs of OLAP data modeling. The basic building blocks of ADAPT are <strong>cubes and dimensions</strong>, which are the core objects of the OLAP multidimensional data model.</p>
<p>Although ADAPT was created for OLAP cubes in the old days, most techniques and frameworks also apply to regular data modeling nowadays. There are nine ADAPT database objects, and their symbols illustrate how to use logos with simple examples.</p>













  

























<figure>
<a target="_blank" href="/blog/data-modeling-for-data-engineering-architecture-pattern-tools-future/images/adapt-legend.jpg" title="/blog/data-modeling-for-data-engineering-architecture-pattern-tools-future/images/adapt-legend.jpg">

</a><figcaption class="image-caption">Legend of ADAPT Framework | Source unknown</figcaption>
</figure>
<h4 id="why-adapt-over-er-and-dimensional-modeling">Why ADAPT over ER and dimensional modeling</h4>
<p>ADAPT is considered superior to ER and dimensional modeling for several reasons:</p>
<ol>
<li><strong>Incorporation of Both Data and Process</strong>: ADAPT incorporates both data and process in its approach, which is particularly useful for designing (OLAP) data marts.</li>
<li><strong>Logical Modeling</strong>: ADAPT emphasizes logical modeling. This prevents the designer from jumping to solutions before fully understanding the problem.</li>
<li><strong>Enhanced Communication</strong>: ADAPT enhances communication among project team members, providing an everyday basis for discussion. This improved communication leads to higher-quality software applications and data models.</li>
<li><strong>Comprehensive Representation</strong>: ADAPT allows for the representation of an (OLAP) application in its entirety without compromising the design due to the limitations of a modeling technique designed for another purpose.</li>
</ol>
<p>In summary, ADAPT is a says to be a more flexible, comprehensive, and communication-enhancing modeling technique for OLAP databases compared to ER and dimensional modeling.</p>
<h3 id="beam-for-agile-data-warehousing">BEAM for Agile Data Warehousing</h3>
<p><a href="http://www.decisionone.co.uk/" target="_blank" rel="noopener noreffer">BEAM</a>, or Business Event Analysis &amp; Modeling, is a method for agile requirement gathering designed explicitly for Data Warehouses, created by Lawrence Corr and Jim Stagnitto in the <a href="https://www.amazon.com/Agile-Data-Warehouse-Design-Collaborative/dp/0956817203" target="_blank" rel="noopener noreffer">Agile Data Warehouse Design</a> book. BEAM centers requirement analysis around business processes instead of solely focusing on reports.</p>
<p>It uses an inclusive, collaborative modeling notation to document business events and dimensions in a tabular format. This format is <strong>easily understood</strong> by business stakeholders and easily implemented by developers. The idea is to facilitate interaction among team members, enabling them to think dimensionally from the get-go and foster a sense of ownership among business stakeholders.</p>
<p>The principles of BEAM include:</p>
<ul>
<li><a href="https://modelstorming.com/" target="_blank" rel="noopener noreffer"><strong>Modelstorming</strong></a><strong>:</strong> Business intelligence is driven by what users ask about their business. The technical setting is secondary. Storyboarding the data warehouse to discover and plan iterative development</li>
<li><strong>Asking stories with 7W</strong>: Telling dimensional data stories using the 7Ws (who, what, when, where, how many, why, and how—model by example, not abstraction; using data story themes.</li>
<li><strong>Visual modeling</strong>: Sketching timelines, charts, and grids to model complex process measurement – simply</li>
<li><strong>Business Driven</strong>: Well-documented data warehouses that take years to deploy will always need to be updated. Business users will look elsewhere. Usually with business units: &ldquo;I need it now, or I’d rather stick with Excel solution ….&rdquo;</li>
<li><strong>Customer Collaboration</strong>: End users’ business knowledge is your greatest resource.</li>
<li><strong>Responding to Change</strong>: Promoting change through the above actions, leading to weekly delivery cycles.</li>
<li><strong>Agile design documentation</strong>: Enhancing star schemas with BEAM dimensional shorthand notation</li>
</ul>
<p>Lawrence Corr emphasizes the importance of asking the right questions or &ldquo;data stories.&rdquo; For instance, a customer’s product purchase could trigger questions about the order date, purchase and delivery locations, the quantity bought, purchase reason, and buying channel. A comprehensive picture of the business process is formed by carefully addressing these questions and providing the basis for technical specifications.</p>
<h3 id="common-data-model">Common Data Model</h3>
<p>There are examples of standard data models, so you do not need to start from scratch. The concept behind these approaches is to transform data contained within those databases into a standard format (data model) and a common representation (terminologies, vocabularies, coding schemes), then perform systematic analyses using a library.</p>
<p>For example, every model needs dimensions such as customer, region, etc. Some references I found you see below:</p>
<ul>
<li>Microsoft started <a href="https://github.com/microsoft/CDM" target="_blank" rel="noopener noreffer">The Common Data Model (CDM)</a>.</li>
<li><a href="https://www.ohdsi.org/data-standardization/" target="_blank" rel="noopener noreffer">OMOP Common Data Model</a> for health care. It allows for systematically analyzing disparate observational databases.</li>
<li>Elastic has its <a href="https://www.elastic.co/guide/en/ecs/current/index.html" target="_blank" rel="noopener noreffer">Common Schema (ECS) Reference</a>.</li>
<li>A bit more advanced, but a <a href="https://github.com/holistics/dbml" target="_blank" rel="noopener noreffer">DBML (Database Markup Language)</a>, an open-source DSL language designed to define and document database schemas and structures, tries to create a standard for determining these models. DBML is intended to be simple, consistent, and highly readable. Works well with <a href="https://dbdiagram.io/" target="_blank" rel="noopener noreffer">dbdiagram.io</a>.</li>
</ul>
<h2 id="applying-data-modeling--best-practices">Applying Data Modeling / Best Practices?</h2>
<p>At the end of all we learned, how do you apply data modeling in practice? This series gave you a good introduction and some pointers to look for when you start with data modeling.</p>
<p>Below I found the <a href="https://www.spiceworks.com/tech/big-data/articles/what-is-data-modeling/#_003" target="_blank" rel="noopener noreffer">Best Practices for Data Modeling</a>, which guides you through some of the critical steps from one, designing the data model to eleven, verifying and testing the application of your data analytics:</p>
<ol>
<li>Design the data model for visualization</li>
<li>Recognize the demands of the business and aim for relevant results</li>
<li>Establish a single source of truth</li>
<li><strong>Start with simple data modeling</strong> and expand later</li>
<li>Double-check each step of your data modeling</li>
<li>Organize business queries according to dimensions, data, filters, and order</li>
<li>Perform computations beforehand to prevent disputes with end customers</li>
<li>Search for a relationship rather than just a correlation</li>
<li>Using contemporary tools and methods</li>
<li>Enhanced data modeling for improved business results</li>
<li>Verify and test the application of your data analytics</li>
</ol>
<p>Another excellent best practice for dbt can be found in <a href="https://airbyte.com/blog/dbt-data-model" target="_blank" rel="noopener noreffer">How to Write a High-Quality Data Model From Start to Finish Using dbt</a> by <a href="https://airbyte.com/blog-authors/madison-schott" target="_blank" rel="noopener noreffer">Madison</a> or applying dimension model with Kimball and dbt in <a href="https://docs.getdbt.com/blog/kimball-dimensional-model" target="_blank" rel="noopener noreffer">Building a Kimball dimensional model with dbt</a> (<a href="https://github.com/Data-Engineer-Camp/dbt-dimensional-modelling" target="_blank" rel="noopener noreffer">GitHub</a>) by <a href="https://www.linkedin.com/in/jonneo/" target="_blank" rel="noopener noreffer">Jonathan</a>.</p>
<h2 id="the-future-of-data-modeling">The Future of Data Modeling</h2>
<p>As I’ve delved into the intricacies of data modeling in the previous parts of this series, it’s clear that we’re witnessing a revolution in the way we perceive, manage, and interact with data. The digital age is characterized by information overload, and data modeling provides the framework to harness this data and transform it into valuable insights.</p>
<p>Reflecting on the future of data modeling, I can’t help but feel a sense of optimism mixed with anticipation. It’s thrilling to envision a world where data-driven decisions are the norm rather than the exception, and I genuinely believe we’re on the right track.</p>
<p>Emerging technologies like AI and machine learning promise to streamline further and automate the process of data modeling. There’s potential for AI to take on a more active role in data modeling, translating complex business logic into coherent data structures.</p>
<p>This vision, however, doesn’t mean we can become complacent. It’s more crucial than ever for data professionals to stay on top of evolving industry trends and techniques. And then, there’s the matter of the various data modeling tools available. The future will likely expand the number of open-source and proprietary options. But at the end of the day, selecting the right tool will always come down to your specific requirements, constraints, and the nature of your data.</p>
<p>We also must remember the importance of data architecture patterns. As data grows in volume and complexity, finding the most suitable architecture becomes increasingly critical. The choice between batch vs. streaming or data lake vs. data warehouse could significantly impact your data modeling efforts. So decisions around implementing a semantic layer, opting for a modern/open data stack, or navigating between centralized and decentralized patterns.</p>
<p>As I wrap up this series of data modeling, I encourage you all to keep learning, experimenting, and pushing the boundaries of what’s possible with your data. Remember, the essence of data modeling is simplicity, no matter how complex the underlying data might be. The future of data modeling is constant change. But one is inevitable; it will be critical for every company. I’m excited to see how the field evolves and how we, as data practitioners, continue to drive this evolution :).</p>
<h2 id="learning-more-about-data-modeling">Learning more about Data Modeling</h2>
<p>Below are some resources and helpful comments I gathered from you all; thank you for the valuable feedback throughout writing this article.</p>
<h3 id="resources">Resources</h3>
<ul>
<li>MongoDB Courses and Trainings, offering comprehensive guides to understanding and mastering MongoDB. <a href="https://learn.mongodb.com/courses/m320-mongodb-data-modeling" target="_blank" rel="noopener noreffer">Link</a></li>
<li>Book: &ldquo;Agile Data Warehouse Design: Collaborative Dimensional Modeling, from Whiteboard to Star Schema&rdquo; by Lawrence Corr. This is a foundational text for understanding dimensional modeling in an agile context. <a href="https://www.amazon.com/Agile-Data-Warehouse-Design-Collaborative/dp/0956817203" target="_blank" rel="noopener noreffer">Link</a></li>
<li>Book: &ldquo;Data Warehouse Blueprints: Business Intelligence in der Praxis&rdquo; by Dani Schnider, Claus Jordan, Peter Welker, and Joachim Wehner. This is an excellent resource for German speakers. <a href="https://www.amazon.com/Data-Warehouse-Blueprints-Business-Intelligence-ebook/dp/B01M0YX6AS/" target="_blank" rel="noopener noreffer">Link</a></li>
<li>Book: &ldquo;<a href="https://www.amazon.com/Data-Reality-Perspective-Perceiving-Information/dp/1935504215" target="_blank" rel="noopener noreffer">Data and Reality</a>&rdquo; - a timeless guide to data modeling, recommended by <a href="https://twitter.com/JennaJrdn" target="_blank" rel="noopener noreffer">Jenna Jordan</a>.</li>
<li>Video: &ldquo;Data Modeling in the Modern Data Stack&rdquo; - a valuable resource for understanding the current state of data modeling. <a href="https://youtu.be/IdCmMkQLvGA" target="_blank" rel="noopener noreffer">Link</a></li>
<li>Article: &ldquo;Introducing Entity-Centric Data Modeling for Analytics&rdquo; on Preset - a good read for understanding an entity-centric approach to data modeling. <a href="https://preset.io/blog/introducing-entity-centric-data-modeling-for-analytics/" target="_blank" rel="noopener noreffer">Link</a></li>
<li>Website: <a href="http://agiledata.io/" target="_blank" rel="noopener noreffer">AgileData.io</a> by Shane Gibson - a resource for reducing the complexity of managing data for Leaders, Analysts, and Consultants. <a href="https://agiledata.io/" target="_blank" rel="noopener noreffer">Link</a></li>
<li>Podcasts: &ldquo;Shane Gibson - Making Data Modeling Accessible - The Joe Reis Show&rdquo; on Spotify. <a href="https://open.spotify.com/episode/4DNyy4cIttEFMUEWjKEHqV?si=df46c60e7d334e0e" target="_blank" rel="noopener noreffer">Link</a></li>
</ul>
<h3 id="helpful-comments">Helpful comments</h3>
<ul>
<li>Use well-defined ontologies that describe your business and relationships between components using common industry concepts, as suggested by Rayner Däppen. <a href="https://www.linkedin.com/feed/update/urn:li:activity:7044294859238567936?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A7044294859238567936%2C7044383659683926016%29" target="_blank" rel="noopener noreffer">Link to comment</a></li>
<li>Keep your data models updated, aligned, documented, validated, and verified with the business. This will ensure the models accurately reflect the current state of the company.</li>
<li>Consider where to build the semantic/metrics layer to allow for fast, interactive analytics/dashboards and how to make it available to multiple tools to avoid various definitions. <a href="https://twitter.com/Triamus1/status/1638612934455599119" target="_blank" rel="noopener noreffer">Link to comment</a></li>
<li>From &ldquo;<a href="https://youtu.be/IdCmMkQLvGA" target="_blank" rel="noopener noreffer">Data Modeling in the Modern Data Stack</a>&rdquo; - computation is now the expensive part of data modeling, not storage. A hybrid approach is often used in modern data stacks to balance complexity, computational cost, data redundancy, and adaptability.</li>
</ul>
<hr>
<pre class=""><em>Originally published at <a href="https://airbyte.com/blog/data-modeling-unsung-hero-data-engineering-architecture-pattern-tools/" target="_blank" rel="noopener noreferrer">Airbyte.com</a></em></pre>
]]></description>
</item>
<item>
    <title>Data Modeling – The Unsung Hero of Data Engineering: Modeling Approaches and Techniques (Part 2)</title>
    <link>https://www.ssp.sh/blog/data-modeling-for-data-engineering-approaches-techniques/</link>
    <pubDate>Wed, 03 May 2023 22:17:22 &#43;0200</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/data-modeling-for-data-engineering-approaches-techniques/</guid><enclosure url="https://www.ssp.sh/blog/data-modeling-for-data-engineering-approaches-techniques/images/data-modeling-approaches-and-techniques.jpg" type="image/jpeg" length="0" /><description><![CDATA[<p>In case you missed Part 1, <a href="/blog/data-modeling-for-data-engineering-introduction/" rel="">An Introduction to Data Modeling</a>, make sure to check first, where we discussed the importance of data modeling in data engineering, the history, and the increasing complexity of data. We have also touched upon the significance of understanding the data landscape, its challenges, and much more.</p>
<p>As we delve deeper into this topic, Part 2 will focus on data modeling approaches and techniques. These methods play a vital role in effectively designing and structuring data models, allowing organizations to gain valuable insights from their data.</p>
<p>We will discuss various data modeling approaches, such as top-down and bottom-up, and specific data modeling techniques, like dimensional modeling, data vault modeling, and more. We will address the common challenges faced in data modeling and how to mitigate them. By understanding these approaches and techniques, data engineers can better navigate the complexities of data modeling and design systems that cater to their organization&rsquo;s specific needs and goals.</p>
<h2 id="data-modeling-approaches">Data Modeling Approaches</h2>
<p>When designing your data model, you typically begin with a top-down approach. You sit with your business owners and domain experts and ask them questions to understand what entities your organization will implement, e.g., customer, product, and sales. Bottom-up is the alternative, where you go from a physical data model. We&rsquo;ll discuss it in a minute.</p>
<p>Agreeing on how you visualized your data flow, and your data is essential to brainstorm up front between domain experts and the data engineering, business intelligence, or analytics engineer involved. It will help you design it in steps, avoid siloed modeling by engineers only, and define familiar entities used company-wide.</p>
<p>It also helps and should be done to define <a href="https://glossary.airbyte.com/term/key-performance-indicator-kpi/" target="_blank" rel="noopener noreffer">KPIs</a> and <a href="https://glossary.airbyte.com/term/key-performance-indicator-kpi/" target="_blank" rel="noopener noreffer">Metrics</a> upfront. These are agreed-upon goals you want to achieve together. As a data modeler, here is where you get the dimensions you need and the fact <a href="https://glossary.airbyte.com/term/granularity" target="_blank" rel="noopener noreffer">granularity</a>. E.g., are we talking monthly, weekly, or daily revenue numbers? Are these plotted on a map per city or country?</p>
<p>The challenge of data modeling is <a href="https://glossary.airbyte.com/term/data-literacy" target="_blank" rel="noopener noreffer">Data Literacy</a>. Data literacy is the ability to derive meaningful information from data, just as literacy, in general, is the ability to derive information from the written word—or, said differently, extrapolating the business value from the data given. Let&rsquo;s figure out how to mitigate those problems with processes and techniques.</p>
<div class="details admonition warning open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-warning"></i>Terminology is important<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">It&rsquo;s essential to get the Terminology right so that everyone understands what the customer is, or if not, you need to be more specific and say oss_customer and enterprise_customer, for example. We don&rsquo;t need to reinvent these entities; there are many reference models you can borrow from.</div>
        </div>
    </div>
<h3 id="conceptual-logical-and-physical-data-models">Conceptual, Logical, and Physical Data Models</h3>
<p>Let&rsquo;s start with the Conceptual Data Model represents a high-level view (top-down), the logical data model provides a more detailed representation of data relationships, and the physical data model defines the actual implementation in the database or data storage system (bottom-up)—more on the top and bottom-up approaches in the following chapters.</p>













  

























<figure>
<a target="_blank" href="/blog/data-modeling-for-data-engineering-approaches-techniques/images/how-data-modeling-works.png" title="/blog/data-modeling-for-data-engineering-approaches-techniques/images/how-data-modeling-works.png">

</a><figcaption class="image-caption">The Conceptual, Logical, and Physical Data Modeling Flow</figcaption>
</figure>
<p>A common approach is to start with the <em>conceptual</em> model, where you define the entities in your organization from a top-down and high-level perspective and model them together. Usually, the <a href="https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model" target="_blank" rel="noopener noreffer">Entity Relationship Diagram (ERD)</a> is used for this.</p>
<p>Later, you move to a logical model where you add more details, such as the ID, whether you create a PK or use the business key from the source system. Are you modeling the customer as normalized tables (customer, address, geographic), or do you keep it simple with duplications? How do you implement change history of data; are you using <a href="https://glossary.airbyte.com/term/slowly-changing-dimension-scd" target="_blank" rel="noopener noreffer">Slowly Changing Dimension (Type 2)</a> or snapshotting all dimensions?</p>
<p>The <em>physical</em> model is where you implement or generate it to the destination database system respecting each database&rsquo;s slightly different syntax.</p>













  

























<figure>
<a target="_blank" href="/blog/data-modeling-for-data-engineering-approaches-techniques/images/simple.jpg" title="/blog/data-modeling-for-data-engineering-approaches-techniques/images/simple.jpg">

</a><figcaption class="image-caption">Simple flow from conceptual to the physical data model.</figcaption>
</figure>
<p>The Benefit of a conceptual model in a larger enterprise organization of sufficient complexity has a conceptual data model gives you a great framework to order the logical and <strong>physical models around, decoupled from the source systems</strong>.</p>
<div class="details admonition question open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-question"></i>Logical Layer == Semantic Layer?<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">As a <a href="https://glossary.airbyte.com/term/semantic-layer" target="_blank" rel="noopener noreffer">Semantic Layer</a> is essentially a logical business layer, it must be considered using one during the model part. More on this in <a href="/blog/data-modeling-for-data-engineering-architecture-pattern-tools-future/" rel="">Part 3</a>.</div>
        </div>
    </div>
<h4 id="conceptual-data-model-top-down-approach">Conceptual Data Model: Top-Down Approach</h4>
<p>In a top-down approach, you start with a high-level view of the organization&rsquo;s data requirements. This involves working with business owners, domain experts, and other stakeholders to understand the business needs and create a conceptual data model.</p>
<p>Then, you iteratively refine the model, moving from conceptual to logical and finally to the physical data model. This approach is <strong>particularly suitable</strong> when there is a <strong>clear understanding of the business requirements and goals</strong>.</p>
<h4 id="physical-data-model-bottom-up-approach">Physical Data Model: Bottom-Up Approach</h4>
<p>On the other hand, the bottom-up approach begins with analyzing the existing data sources, such as databases, spreadsheets, or different structured and unstructured data. Based on this analysis, you create a physical data model that reflects the current data storage and relationships.</p>
<p>Next, you work your way up, creating a logical data model to represent the business requirements and a conceptual model to provide a high-level view of the data. The bottom-up approach is <strong>beneficial when dealing with legacy systems</strong> or when there needs to be more knowledge of the organization&rsquo;s data requirements.</p>
<h4 id="combining-top-down-and-bottom-up-approaches-in-data-modeling">Combining top-down and bottom-up approaches in data modeling</h4>
<p>Combining top-down and bottom-up approaches may be the best solution in many cases. By blending these methods, you can capitalize on the strengths of each approach and create a comprehensive data model that meets your organization&rsquo;s needs.</p>
<p>Regardless of your chosen approach, it&rsquo;s essential to maintain <strong>clear communication among all stakeholders</strong> and ensure that the data model aligns with the organization&rsquo;s objectives and supports effective decision-making.</p>
<h3 id="hierarchical-data-modeling-network-data-modeling-and-object-role-modeling">Hierarchical Data Modeling, Network Data Modeling and Object-Role Modeling?</h3>
<p>While searching for other approaches, I came across hierarchical, network, and object-oriented data modeling.</p>













  

























<figure>
<a target="_blank" href="/blog/data-modeling-for-data-engineering-approaches-techniques/images/other-data-modeling.png" title="/blog/data-modeling-for-data-engineering-approaches-techniques/images/other-data-modeling.png">

</a><figcaption class="image-caption">Different Data Models | Image from <a href="https://en.wikipedia.org/wiki/Data_model" target="_blank" rel="noopener noreffer">Wikipedia</a></figcaption>
</figure>
<p>The <strong>hierarchical data modeling</strong> organizes data in a tree-like structure, with parent-child relationships between entities. It is suitable for representing hierarchical data or nested relationships, such as organizational structures or file systems.</p>
<p>The <strong>network data modeling</strong> approach models data as interconnected nodes in a graph, allowing for complex relationships between entities. It helps represent many-to-many relationships and networks, such as social networks, transportation networks, or recommendation systems.</p>
<p>The <strong>object-role modeling</strong> is an attribute-free, fact-based data modeling method that ensures a correct system and enables the derivation of ERD, UML, and semantic models while inherently achieving database normalization.</p>
<p>Some <a href="https://en.wikipedia.org/wiki/Data_model#Types" target="_blank" rel="noopener noreffer">more</a>, such as the flat model, object–relational model, are listed on Wikipedia for data models.</p>
<h2 id="data-modeling-techniques">Data Modeling Techniques</h2>
<p>In <a href="/blog/data-modeling-for-data-engineering-introduction/" rel="">Part 1</a>, we introduced the <a href="/blog/data-modeling-unsung-hero-data-engineering-introduction#different-levels-of-data-modeling" rel="">various levels</a> of data modeling, including generation or source database design, data integration, ETL processes, data warehouse schema creation, data lake structuring, BI tool presentation layer design, and machine learning or AI feature engineering. We also discussed different approaches to data modeling in the previous chapter. This chapter will delve deeper into the practical techniques used in the data modeling process.</p>
<p>These techniques are primarily employed in batch-related processes and cater to the design and modeling of <a href="https://glossary.airbyte.com/term/data-warehouse" target="_blank" rel="noopener noreffer">Data Warehouses</a>, <a href="https://glossary.airbyte.com/term/data-lake/" target="_blank" rel="noopener noreffer">Lakes</a>, or <a href="https://glossary.airbyte.com/term/data-lakehouse/" target="_blank" rel="noopener noreffer">Lakehouses</a>. We will explore each technique’s unique benefits and applications in modern data engineering.</p>
<h3 id="dimensional-modeling">Dimensional Modeling</h3>
<p>There are many different techniques, but <a href="https://glossary.airbyte.com/term/dimensional-modeling" target="_blank" rel="noopener noreffer">Dimensional Modeling</a> is probably the <strong>most famous</strong> and the one that has stood out the longest. Its birth was with the inception of the data warehouse and the release of the iconic <a href="https://www.amazon.com/Data-Warehouse-Toolkit-Definitive-Dimensional-ebook/dp/B00DRZX6XS" target="_blank" rel="noopener noreffer">The Datawarehouse Toolkit</a> book in a 1996 book.</p>













  

























<figure>
<a target="_blank" href="/blog/data-modeling-for-data-engineering-approaches-techniques/images/books.png" title="/blog/data-modeling-for-data-engineering-approaches-techniques/images/books.png">

</a><figcaption class="image-caption">The History of the Data Warehouse Toolkit book in Perspective to Cloud Data Warehouse | Image by Josh and Sydney from above mentioned talk about <a href="https://docs.google.com/presentation/d/1HLP1FfCNZJUIF7JgT1ote5LaG2tDvsNRfrA9vIkf2n4/edit#slide=id.g15a36e3d63c_0_385" target="_blank" rel="noopener noreffer">Babies and bathwater</a></figcaption>
</figure>
<p>The data space has changed a lot since then, so the question arises, &ldquo;Is dimensional modeling still needed within data engineering compared to its popularity way back?&rdquo; Let’s find out in the following chapters.</p>
<p>A quick reminder of how data modeling looked for a long time:</p>













  

























<figure>
<a target="_blank" href="/blog/data-modeling-for-data-engineering-approaches-techniques/images/simple-startschema.png" title="/blog/data-modeling-for-data-engineering-approaches-techniques/images/simple-startschema.png">

</a><figcaption class="image-caption">The retail sales star schema, example from Kimball | Image by <a href="https://www.researchgate.net/figure/The-retail-sales-star-schema-example-from-Kimball-02_fig2_277060637" target="_blank" rel="noopener noreffer">Research Gate</a>.</figcaption>
</figure>
<h4 id="data-modeling-vs-dimensional-modeling">Data Modeling vs. Dimensional Modeling</h4>
<p>Let’s start with the difference between dimensional and data modeling to understand why we even discuss it.</p>
<ul>
<li><strong>Data modeling</strong> is the broad term that encompasses various techniques and methodologies for representing and modeling data across a company.</li>
<li><strong>Dimensional modeling</strong> is a specific approach to data modeling that is particularly suited for data warehousing, business intelligence (BI) applications, and newer data engineering data models.</li>
</ul>
<h4 id="what-is-dimensional-modeling">What is Dimensional Modeling</h4>
<p>So what, then, is dimensional modeling? Dimensional modeling focuses on creating a simplified, intuitive structure for data by <strong>organizing data into facts and dimensions</strong>, making it easier for end-users to query and analyze the data.</p>
<p>In dimensional modeling, data is typically stored in a star schema or snowflake schema (more later), where a central fact table contains the quantitative data, and it is connected to multiple dimension tables, each representing a specific aspect of the data’s context. This structure enables efficient querying and aggregation of data for analytical purposes.</p>
<div class="details admonition example open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-example"></i>Context<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content">Context on Facts and Dimensions Facts represent quantitative or measurable data (e.g., sales, revenue, etc.) and dimensions represent the context or descriptive attributes (e.g., customer, product, time, etc.).</div>
        </div>
    </div>
<p>The dimensional modeling approach focuses on identifying the key business entities and modeling these in an easy-to-understand way for consumers.</p>
<p>With the <a href="https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dw-bi-lifecycle-method/" target="_blank" rel="noopener noreffer">DW/BI Lifecycle Methodology</a> that was created later in the 90s, Kimball’s core ideas applying still to this very day which is:</p>
<ul>
<li>Focus on adding <em>business</em> value across the enterprise.</li>
<li><em>Dimensionally</em> structure the data that are delivered to the business.</li>
<li>Iteratively develop in manageable <em>lifecycle</em> increments rather than attempting a Big Bang approach.</li>
</ul>
<p><strong>Key concepts</strong> of dimensional modeling could be an article, and so much content exists. I leave you here with some links to learn more. For example, around dimensions with <a href="https://www.kimballgroup.com/2011/06/design-tip-135-conformed-dimensions-as-the-foundation-for-agile-data-warehousing/" target="_blank" rel="noopener noreffer">Conformed Dimensions</a>, <a href="https://www.kimballgroup.com/2009/06/design-tip-113-creating-using-and-maintaining-junk-dimensions/" target="_blank" rel="noopener noreffer">Junk Dimension</a> or <a href="https://en.wikipedia.org/wiki/Slowly_changing_dimension" target="_blank" rel="noopener noreffer">Slowly Changing Dimension</a>, <a href="https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/additive-semi-additive-non-additive-fact/" target="_blank" rel="noopener noreffer">Additive, semi-additive, and non-additive facts</a>, and many more; check out <a href="https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/" target="_blank" rel="noopener noreffer">Dimensional Modeling Techniques</a> for more.</p>
<h4 id="why-is-dimensional-modeling-still-relevant-today">Why is Dimensional Modeling Still Relevant Today?</h4>
<p>But is dimensional modeling and all its associated concepts still relevant today? The answer is a resounding yes—perhaps even more so than before. As you&rsquo;ve read in the preceding chapters, dimensional modeling aims to achieve a focus on business value. In today&rsquo;s rapidly evolving world, this crucial aspect is sometimes overlooked.</p>
<p>By incorporating a robust dimensional model at the core of every data project, data engineers are compelled to <strong>consider critical questions related to granularity, entities, metrics</strong>, and more. Addressing these essential aspects upfront and working towards them is invaluable for achieving business goals and driving project success.</p>
<h4 id="star-schema-vs-snowflake-schema-kimball-and-inmon">Star Schema vs. Snowflake Schema: Kimball and Inmon</h4>
<p>In data warehousing, the star and snowflake schemas are standard data modeling techniques that are highly related to dimensional modeling.</p>
<p>The <strong>star schema</strong>, typically associated with Kimball&rsquo;s approach, has a central fact table connected to dimension tables, emphasizing staging and <strong>denormalized core</strong> data. This bottom-up approach allows for quicker access to denormalized data for analysis. Conversely, the <strong>snowflake schema</strong> further normalizes dimension tables, creating a more complex structure. This schema aligns with Inmon&rsquo;s approach, emphasizing a <strong>highly normalized core</strong> close to the source system, suitable for large-scale data warehousing projects.</p>
<p>The nuances between the star and snowflake schemas and their corresponding approaches can guide data professionals in designing and implementing data warehouses, but won&rsquo;t dictate success or failure. Both schemas have their merits, and their choice depends on personal preference or specific project requirements. Understanding these differences allows data professionals to make informed decisions tailored to their unique projects.</p>
<h3 id="data-vault-modeling-a-flexible-and-dynamic-approach">Data Vault Modeling: A Flexible and Dynamic Approach</h3>
<p>Compared to dimensional modeling, <a href="https://en.wikipedia.org/wiki/Data_vault_modeling" target="_blank" rel="noopener noreffer">Data Vault</a> modeling is a method that addresses the challenges of modern data warehousing, mainly when <strong>dealing with big data and fast-changing data</strong> connection points. This hybrid data modeling approach combines the best aspects of 3NF (Third Normal Form) and Star Schema methodologies, resulting in a scalable, flexible, and agile solution for building data warehouses and data marts.</p>
<p>The primary components of a Data Vault model are Hubs, Satellites, and Links. Hubs represent business keys, Satellites store descriptive attributes, and Links define relationships between Hubs. This unique structure allows for rapid ingestion of new data sources, supports historical data tracking, and is well-suited for large-scale data integration and warehousing projects.</p>
<p>Moreover, Data Vault modeling has been increasingly utilized as a governed Data Lake due to its ability to adapt to changing business environments, support massive data sets, simplify data warehouse design complexities, and increase usability for business users. By modeling after the business domain, Data Vault ensures that new data sources can be added without impacting the existing design. As a result, Data Vault modeling, in conjunction with Data Warehouse Automation, is proving to be a highly effective and efficient approach to data management in contemporary data-driven business landscapes.</p>
<h3 id="anchor-modeling">Anchor Modeling</h3>
<p><a href="https://www.anchormodeling.com/" target="_blank" rel="noopener noreffer">Anchor Modeling</a> is an agile data modeling technique designed to handle evolving data structures like data vault modeling. It is built around storing each attribute as a separate table, allowing for more flexibility when dealing with schema changes. This approach is beneficial when the data model must frequently evolve and adapt to new requirements. Anchor Modeling is known for efficiently handling schema changes and reducing data redundancy.</p>
<p>Compared to data vault modeling, anchor modeling focuses on the change in information both in structure and content. It separates identities (anchors), context (attributes), relationships (ties), and finite value domains (knots). The focus is on the flexibility and temporal capabilities of the data model, capturing changes in information over time.</p>
<h3 id="bitemporal-modeling-a-comprehensive-approach-to-handling-historical-data">Bitemporal Modeling: A Comprehensive Approach to Handling Historical Data</h3>
<p>A more niche but still valid modeling technique is <a href="https://roelantvos.com/blog/a-gentle-introduction-to-bitemporal-data-challenges/" target="_blank" rel="noopener noreffer">Bitemporal Modeling</a>.</p>
<p>Bitemporal modeling is a specialized technique that handles historical data along two distinct timelines. This approach enables organizations to access data from <strong>different vantage points in time.</strong> It allows for the recreation of past reports as they appeared and how they should have appeared, given any corrections made to the data after its creation. Bitemporal modeling is beneficial in sectors like financial reporting, where maintaining accurate historical records is critical.</p>
<p>Focusing on the completeness and accuracy of data, bitemporal modeling allows for creating comprehensive audit trails. Using bitemporal structures as the fundamental components, this modeling technique results in databases with consistent temporality for all data. All data becomes immutable, enabling queries to provide the most accurate data possible, data as it was known at any time and information about when and why the most accurate data changed.</p>
<p>Bitemporal modeling can be implemented using relational and graph databases, and while it is different from dimensional modeling, it complements database normalization. The <a href="https://en.wikipedia.org/wiki/SQL:2011" target="_blank" rel="noopener noreffer">SQL:2011</a> standard includes language constructs for working with bitemporal data. Read more on <a href="https://roelantvos.com/blog/a-gentle-introduction-to-bitemporal-data-challenges/" target="_blank" rel="noopener noreffer">a gentle introduction to bitemporal data challenges</a>.</p>
<h3 id="entity-centric-data-modeling-ecm">Entity-Centric Data Modeling (ECM)</h3>
<p>A relatively new modeling technique is <a href="https://preset.io/blog/introducing-entity-centric-data-modeling-for-analytics/" target="_blank" rel="noopener noreffer">Entity-Centric Data Modeling (ECM)</a> introduced by <a href="https://glossary.airbyte.com/term/maxime-beauchemin/" target="_blank" rel="noopener noreffer">Maxime Beauchemin</a>. Entity-centric data modeling (ECM) elevates the core idea of an &ldquo;entity&rdquo; (i.e., user, customer, product, business unit, ad campaign, etc.) at the very top for analytics data modeling.</p>
<p>It&rsquo;s interesting as its core focuses on the strength of precisely the points discussed above that dimensional provides. As it&rsquo;s old, it also has some missing features that today&rsquo;s world is needed. That&rsquo;s why Max updated it and merged it with <a href="https://en.wikipedia.org/wiki/Feature_engineering" target="_blank" rel="noopener noreffer">Feature Engineering</a>, used in the ML project. You can find more in his <a href="https://preset.io/blog/introducing-entity-centric-data-modeling-for-analytics/" target="_blank" rel="noopener noreffer">latest article</a> or my comments in the <a href="https://airbyte.com/content-hub/blog/datanews-filter-navigating-entity-centric-modeling" target="_blank" rel="noopener noreffer">DataNews.filter() newsletter</a> 📰.</p>
<h2 id="common-problems-with-data-modeling">Common Problems with Data Modeling</h2>
<p>Data modeling is easy to neglect; <strong>assessing the consequences can take time and effort</strong>. The image below illustrates that if you initially ignore poor data modeling and architecture decisions, you&rsquo;ll likely notice problems in the last mile, thinking they might be due to the tools or insights. However, the fundamental issues primarily originate in the first part of the data analytics cycle.</p>













  

























<figure>
<a target="_blank" href="/blog/data-modeling-for-data-engineering-approaches-techniques/images/pain-points-data-modeling.jpg" title="/blog/data-modeling-for-data-engineering-approaches-techniques/images/pain-points-data-modeling.jpg">

</a><figcaption class="image-caption">I love this image by <a href="https://twitter.com/mattarderne/status/1604528546784870402" target="_blank" rel="noopener noreffer">Matt Arderne on Twitter</a>. Image originally from <a href="https://www.forbes.com/sites/brentdykes/2022/01/12/data-analytics-marathon-why-your-organization-must-focus-on-the-finish/" target="_blank" rel="noopener noreffer">Forbes</a></figcaption>
</figure>
<p>Here are some critical data modeling problems.</p>
<p><a href="https://en.wikipedia.org/wiki/Business_rule" target="_blank" rel="noopener noreffer"><strong>Business rules</strong></a>, which are specific to how things are done in a particular place, are often embedded in the structure of a data model. This leads to a problem where small changes in business processes result in significant changes in computer systems and interfaces. To address this issue, business rules should be implemented flexibly, avoiding complex dependencies and allowing the data model to adapt efficiently to changes in business processes.</p>
<p>Another common issue is that <strong>entity types are often not identified or incorrectly identified</strong>. This can cause data replication, duplicated data structures, and functionality. These duplications increase the costs of development and maintenance. Data definitions should be explicit and easy to understand to prevent this problem, minimizing misinterpretation and duplication.</p>
<p>Data models for different systems can <strong>vary significantly</strong>, creating a need for complex interfaces between systems that share data. These interfaces can account for between 25 and 70% of the cost of current systems. To address this, required interfaces should be considered inherently during data model design, as data models would only be used independently with interfaces within different systems.</p>
<p>Lastly, data cannot be shared electronically with customers and suppliers because <strong>the structure and meaning of data still need to be standardized</strong>. To maximize the value of an implemented data model, it is crucial to define standards that ensure data models meet business needs and maintain consistency. This standardization will enable efficient data sharing between various stakeholders.</p>
<p>To mitigate these issues, <strong>tight integration into the overall data architecture and patterns</strong> can reduce friction. We&rsquo;ll explore these in the next part, 3.</p>
<h2 id="whats-next-in-the-last-part">What&rsquo;s Next in the Last Part?</h2>
<p>Throughout Part 2, we have explored the various data modeling approaches and techniques that serve as the backbone of data engineering. From top-down and bottom-up approaches to conceptual, logical, and physical data models, understanding these methods is crucial for effective data modeling. Techniques like dimensional, data vault, and bitemporal modeling offer unique benefits and cater to a wide range of use cases in modern data engineering. As we have seen, addressing common problems in data modeling and ensuring tight integration into the overall data architecture is essential for success.</p>
<p>In the next Part of this series, &ldquo;<a href="/blog/data-modeling-for-data-engineering-architecture-pattern-tools-future/" rel="">Data Architecture Patterns, Tools, and The Future—Part 3</a>&rdquo;, we will delve into data architecture patterns, tools, and the future of data modeling. Stay tuned as we explore the fascinating world of data modeling within data engineering and its impact on the future of data-driven decision-making.</p>
<hr>
<pre class=""><em>Originally published at <a href="https://airbyte.com/blog/data-modeling-unsung-hero-data-engineering-approaches-and-techniques/" target="_blank" rel="noopener noreferrer">Airbyte.com</a></em></pre>
]]></description>
</item>
</channel>
</rss>
