š§ Second Brain
Search
Declarative Data Stack
Declarative Data Stack is a term introduced in the article The Rise of the Declarative Data Stack by Mike Driscoll and myself.
# What Is a Declarative Data Stack?
A declarative data stack is a set of tools and, precisely, its configs can be thought of as a single function such as run_stack(serve(transform(ingest)))
that can recreate the entire data stack.
Instead of having one framework for one piece, we want a combination of multiple tools combined into a single declarative data stack. Like the Modern Data Stack, but integrated the way Kubernetes integrates all infrastructure into a single deployment, like YAML.
We focus on the end-to-end Data Engineering Lifecycle, from ingestion to visualization. But what does the combination with declarative mean? Think of Functional Data Engineering, which leaves us in a place of confident reproducibility with little side effects (hopefully none) and uses idempotency to restart function to recover and reinstate a particular state with conviction or rollback to a specific version.
More on The Rise of the Declarative Data Stack.
Other Naming
Tobiko calls it an Integrated Data Stack.
Dagster talks about Impedance Mismatch, and Data Asset oriented orchestration: The Data Engineering Impedance Mismatch | Dagster Blog
# Tools
# Close-Sourced DDSEs
Let’s start with closed-source firstāone key point to note. Most of what we’ve discussed here is something that most closed-source tools have implemented in one way or another. Because they’ve built one big monolith, this is relatively straightforward and the natural thing to do.
This can be more challenging and not immediately obvious with an open-source approach and numerous integration tools. Let’s now look at tools that have successfully implemented such features.
- Ascend: The platform automates up to 90% of repetitive data tasks using their DataAware Automation Engine.
- Palantir Foundry: One of the first lakehouse implementations before the term was coined. Enables real-time collaboration between data, analytics, and operational teams through a common logic data lake layer
- Find more on Closed-Source Data Platforms and a fantastic read on composable data stacks on a new frontier by Voltron Data.
- Y42
- Starlake by Hayssam Saleh
Usually, the problem with closed-source software is that it is structured as a monolith, combining transformation logic with persisted database tables while keeping the underlying code unknown.
# Open-Source DDSEs
But even more interesting are the open-source tools I found[^1] - they are fantastic and built in the open, building in the open. Not all might be truly declarative data stacks by their definition, but they all build on top of other tools and declaratively integrate them.
- DataForge: Write functional transformation pipelines by leveraging software engineering principles. It does not have a visualization tool that focuses on transformation[^3].
- Dashtool: A Lakehouse build tool that builds Iceberg tables from declarative SQL statements and generates Kubernetes workflows to keep these tables up-to-date. It handles Ingestion, Transformation, and Orchestration. Written in Rust and uses Datafusion.
- BoilingData: A local-first data processing native application designed for rapid data pipeline development. Enables data engineers to build and test pipelines quickly using tools like DuckDB, dbt, and dlt.
- HelloDATA BE: AnĀ enterprise data platformĀ built on open-source tools based on the modern data stack. It uses state-of-the-art tools such as dbt for data modeling with SQL and Airflow to run and orchestrate tasks, Superset to visualize the BI dashboards, and JupyterHub for data science tasks. It includes multi-tenancy, full authentication, and authorization, which are handled with a single web portal.
- SDF: Similar to DataForge, built on Rust and Datafusion. Tries to be the Typescript for SQL, creating faster development cycles and reliable results with a powerful compiler.
- SQLMesh: An efficient data transformation and modeling framework that has compiler capabilities built-in with SQLGlot, a Python SQL parser and a transpiler.
-
GitHub Actions: This is the
simplest version for building a declarative data stack. A
deploy.yaml
script could be a simple DDS config. GitHub also has an engine that converts and runs on Docker-runners. So, in a way, it’s another engine implementation, and maybe we could take some configs based on that? - Datacoves: The platform helps enterprises solve data analytics challenges with managed dbt, Airflow, and VS Code, adopting best practices. This approach avoids negotiating multiple SaaS contracts and reduces consulting costs without compromising data security.
- Datadex: Serverless and local-first Open Data Platform.
- BigFunctions: A framework to build a governed catalog of powerful BigQuery functions, SQL first approach. Ingesting, advanced data transforms, and serving data on a data app with a single SQL query.
- Hoptimator
- Bruin
- Iasql:: Cloud Infrastructure as data in PostgreSQL
SQL Compilers
SQLGlot would be a good integration to parse SQL without running. Same as SDF integration with Datafusion.
Existing Templating
Beyond tools, templating can solve some of the Jinja Template, GoLang’s template package, biGENIUS Template modules, Apache Velocity, Liquid, and many others.
# Alternatives
# What’s the difference between a Composable Data System
Composable Data Stacks or System sounds very similar. Or also Multi-Engine Data Stacks. Are all of the same, but different wording?
# Testing
We need Deterministic Simulation Testing for DDS, like libSQL does for the sqlite rewrite.
# Declarative Data Stack ENGINE
Engine is an important part, which I go into more details in Designing a Declarative Data Stack: From Theory to Practice | ssp.sh, similar to Markdown can be the code, and HackMD, GDocs and other are engines to run it.
Key distinctions: part-3-example-implementation-declarative-data-stack
Docker and Kubernetes are other engines. There are many more, see part-3-example-implementation-declarative-data-stack.
# Further Reading
- My Series
- From Data Engineer to YAML Engineer
- with Modal: Building a cost-effective analytics stack with Modal, dlt, and dbt | Modal Blog
Origin:
Rill | The Rise of the Declarative Data Stack, The Rise of the Declarative Data Stack - Rill
References:
Created 2024-10-17