🧠 Second Brain

Declarative Data Pipelines

Last updated Feb 16, 2024

A declarative data pipeline is characterized by not specifying the execution order. Instead, it allows each step or task to autonomously determine the optimal time and method for execution.

The essence of a declarative approach is in describing what a program should accomplish, as opposed to dictating the specific control flow. This approach is a hallmark of Functional Data Engineering, which is a declarative programming paradigm, contrasting with the imperative programming paradigms.

# Functional Programming and Its Relation

Functional programming represents a specific subset of declarative paradigms.

As highlighted in Functional Data Engineering:

And in the context of Data Orchestration Trends- The Shift From Data Pipelines to Data Products:

The role of abstractions in defining Data Products is pivotal. Utilizing higher-level abstractions leads to more explicit and declarative design.

In The Rise of the Data Engineer, Maxime Beauchemin articulates that a dedicated function is the optimal higher-level abstraction for software definition (automation, testability, well-defined practices, and openness). This approach leverages an inline declaration of upstream dependencies within an open, Pythonic API, underpinning assets with a Python function.

Conversely, an imperative pipeline delineates each procedural step, dictating how to proceed. In stark contrast, a declarative data pipeline refrains from dictating execution order, empowering each component to independently identify its most effective operational parameters.

# Declarative vs. Imperative

See more on Declarative vs Imperative

# Some Examples

The declarative trends continue with more tools from the modern data stack betting on it. For example, Kestra is a full-fledged YAML Data Orchestrator, Rill Developer is a BI tool as code, dlt is a data integration as code, and many more introduce models; interestingly, many of them use DuckDB under the hood.

# History of similar shifts

Frontend engineering has moved from jQuery, an imperative library that makes it easy to manipulate webpages, to React, which allows writing JavaScript that declares the components you want on the screen.

This concept is closely linked with DevOps, where tools like Kubernetes and Descriptive Configs - YAML have revolutionized deployment practices, called Infrastructure as Code. It’s moved from imperative shell scripts that spin servers up and down to frameworks like Terraform, which allow you to simply declare the servers you want to exist.

Rill Developer is doing the same for Business Intelligence and dashboards, where all dashboards are simple YAML files.

A similar transformation is needed in data pipelines to enhance stability, accelerate development cycles, and ensure scalability.

Origin:
References: Infrastructure as Code
Created 2023-01-25