🧠 Second Brain

dbt

Last updated Apr 19, 2024

dbt (data build tool) allows you to establish macros and integrate other functions outside of SQL’s capabilities for advanced use cases. Macros in Jinja are pieces of code that can be used multiple times.

dbt is a pivotal component of the Modern Data Stack, dbt handles the Transformation aspect of ETL exclusively using SQL. For Python-based tasks, integration with Data Orchestrators like Dagster is necessary.

Explore the dbt Roadmap 2022 to understand their vision for integrating dbt and Python.

# Why dbt? Data Modeling with SQL

dbt is a small database toolset that has gained immense popularity and is the facto standard for working with SQL. Why, you might ask? SQL is the most used language besides Python for data engineers, as it is declarative and easy to learn the basics, and many business analysts or people working with Excel or similar tools might know a little already.

The declarative approach is handy as you only define the what, meaning you determine what columns you want in the SELECT and which table to query in the FROM statement. You can do more advanced things with WHERE, GROUP BY, etc., but you do not need to care about the how. You do not need to watch which database, which partition it is stored, what segment, or what storage. You do not need to know if an index makes sense to use. All of it is handled by the query optimizer of Postgres (or any database supporting SQL).

# Downside of SQL

But let’s face it: SQL also has its downside. If you have worked extensively with SQL, you know the spaghetti code that usually happens when using it. It’s an issue because of the repeatability—no variable we can set and reuse in an SQL. If you are familiar with them, you can achieve a better structure with CTEs, which allows you to define specific queries as a block to reuse later. But this is only within one single query and handy if the query is already long.

But what if you’d like to define your facts and dimensions as a separate query and reuse that in another query? You’d need to decouple the queries from storage, and we would persist it to disk and use that table on disk as a FROM statement for our following query. But what if we change something on the query or even change the name we won’t notice in the dependent queries? And we will need to find out which queries depend on each other. There is no Data Lineage or dependency graph.

It takes a lot of work to be organized with SQL. There is also not a lot of support if you use a database, as they are declarative. You need to make sure how to store them in git or how to run them.

# The power of dbt

That’s where dbt comes into play. dbt lets you create these dependencies within SQL. You can declaratively build on each query, and you’ll get errors if one changes but not the dependent one. You get a lineage graph, unit tests, and more. It’s like you have an assistant that helps you do your job. It’s added software engineering practice that we stitch on top of SQL engineering.

The danger we need to be aware of, as it will be so easy to build your models, is not to make 1000 of 1000 tables. As you will get lots of errors checked by the pre-compiling dbt, good data modeling techniques are essential to succeed.

# History

dbt started from a small consultancy firm. Originally, it started at RJMetrics in 2016 but really took off later in 2018 when Fishtown Analytics more publically showed, and the market was more ready for it. Later it renamed itself to dbt Labs.

Raised a $150M Series C.

# Technical Features

Utilizes Jinja Template macros:
- Automatically discerns whether to merge with an existing table or create a new one.
The start-macro, such as star.sql:
- Effectively performs a SELECT * operation.
A notable, minor feature allowing Rust integration: GitHub - fal-ai/dbt-datafusion
Snapshotting / SCD2 see dbt snapshoting (SCD2)

# Roadmap

Detailed in dbt Roadmap 2022.

# Templates

dbt select stern from information_schema

Origin:
References:
Created 2022-03-18