🧠 Second Brain

Search

Search IconIcon to open search

Declarative Data Pipelines

Last updated Oct 18, 2024

A declarative data pipeline is characterized by not specifying the execution order. Instead, it allows each step or task to autonomously determine the optimal time and method for execution.

The essence of a declarative approach is describing what a program should accomplish rather than dictating the specific control flow. This approach is a hallmark of Functional Data Engineering, a declarative programming paradigm that contrasts with imperative programming paradigms.

Check more on Use declarative pipelining instead of imperative and my recent article on Rill | The Rise of the Declarative Data Stack.

# Functional Programming and Its Relation

Functional programming represents a specific subset of declarative paradigms. Its key is an idempotent approach, in which each function can be restarted without side effects.

As highlighted in Functional Data Engineering:

In the context of Data Orchestration Trends- The Shift From Data Pipelines to Data Products:

The role of abstractions in defining Data Products is pivotal. Utilizing higher-level abstractions leads to more explicit and declarative design.

In The Rise of the Data Engineer, Maxime Beauchemin articulates that a dedicated function is the optimal higher-level abstraction for software definition (automation, testability, well-defined practices, and openness). This approach leverages an inline declaration of upstream dependencies within an open, Pythonic API, underpinning assets with a Python function.

Conversely, an imperative pipeline delineates each procedural step, dictating how to proceed. In stark contrast, a declarative data pipeline refrains from dictating execution order, empowering each component to independently identify its most effective operational parameters.

# Declarative vs. Imperative

See more on Declarative vs Imperative.

# Some Examples

The declarative trends continue with more tools from the modern data stack betting on it. For example, Kestra is a full-fledged YAML Data Orchestrator, Rill Developer is a BI tool as code, dlt is a data integration as code, and many more introduce models; interestingly, many of them use DuckDB under the hood.

# History of Similar Shifts

Frontend engineering has moved from jQuery, an imperative library that makes it easy to manipulate webpages, to React, which allows writing JavaScript that declares the components you want on the screen.

This concept is closely linked with DevOps, where tools like Kubernetes and Descriptive Configs - YAML have revolutionized deployment practices, called Infrastructure as Code. It’s moved from imperative shell scripts that spin servers up and down to frameworks like Terraform, which allow you to simply declare the servers you want to exist.

Rill Developer is doing the same for Business Intelligence and dashboards, where all dashboards are simple YAML files.

A similar transformation is needed in data pipelines to enhance stability, accelerate development cycles, and ensure scalability.

Also, Markdown, HTML, and SQL are declarative, which begs the question: Is that the reason they are so successful?

# Scope of Concerns

From Introduction to the DataForge Declarative Transformation Framework β€” DataForge

Instead of table in/out they think about columns in/out

It goes on with Pure Fuction (Functional Data Engineering).

Matthew Kosovec also made a YouTube Video

How dies the column in and out compare to Software-Defined Asset?

# Declarative Programming in Code

Is it possible to use a general-purpose programming language instead of YAML or configuration files? This approach is often referred to as “declarative programming in code.” We could use a language like Python, TypeScript, or even a DSL to express the desired state and transformations of your data stack.

For example, in Python, you could define your data stack and infrastructure with a framework like Pulumi or CDK (Cloud Development Kit). These tools allow you to write code that declaratively specifies the resources, pipelines, and transformations while still giving you the flexibility of a full programming language.

Advantages:

So, while YAML/configs are simpler and often preferred for pure declarative approaches, a programming language can offer more power and flexibility.

Origin: Benthos.

# More Resources

Questions

What is declarative programming, is it the to declarative pipelines? Functional Programming or even Functional Data Engineering?


Origin:
References: Infrastructure as Code
Created 2023-01-25