Search

Search IconIcon to open search

The State of DevOps in Data Engineering

Last updated by Simon Späti

Is DevOps the new data engineering of data science? What do I mean? As in the old days, when you were doing data science but spent 80% of your time on data engineering.

Nowadays, DevOps is often underrated and neglected in data engineering projects, yet most of the time, you end up spending a significant amount of time on it.

DevOps or GitOps, also known as Infrastructure as Code, is the practice of deploying your data platform in an iterative and code-first way. The data stack gets tested as part of a CI/CD pipeline, monitored through Prometheus events, and visualized via dashboards to track system health. All of these are considered standard practices, yet when you start implementing them, they consume the majority of your time.

I recall 10 years ago when we patiently waited for the data scientist to arrive and solve all our problems. Today, it’s almost the same, yet we wait for a DevOps person who knows how to deploy, manage, and release new versions in a non-disruptive way.

# Learnings

I spent some time thinking and conceptualizing this topic, and here are 8 points I learned:

  1. Separation of concerns is crucial - keeping infrastructure, platform services, and business logic (pipelines) in distinct sections makes maintenance and collaboration easier.
  2. Standardized deployment patterns like the base/overlay structure with Kustomize allow for reusable configurations across environments with minimal environment-specific overrides.
  3. Versioned artifacts with timestamps (e.g., finance-pipeline-20250512123045.tar.gz) create a reliable release process that enables rollbacks and audit trails.
  4. Database migration automation tools, such as Liquibase, can handle schema changes programmatically across environments, thereby reducing manual errors.
  5. Test early, test often - validate data pipelines, infrastructure code, and database migrations separately before integration testing in an isolated environment.
  6. Workspaces separation from infrastructure code enables domain experts, such as data scientists and analysts, to focus on their core competencies while maintaining deployment standards.
  7. CI visibility through lineage diagrams and documented processes enables teams to understand the deployment flow and troubleshoot issues more efficiently.
  8. GitOps, as the single source of truth, means changes occur through Git commits, creating an automatic audit trail and enabling pull request reviews.

# After a While, it Stabilizes

Most of this is also true when you start. Once you have a set of tools, you will have your deployment scripts, and DevOps is essentially finished, except for version upgrades, until you want to add new tools.

# Starting GitOps is Hard.

Question on Bsky

Selling GitOps to new data projects is hard if they haven’t already been burned. And cleaning up the mess in retrospect is difficult and thankless.

True, that’s why it’s worth having someone or a central team that specializes in this and can complete the work in days rather than weeks. And data engineers or other people can focus on their core workload.

# Reference Example for Data Engineering

  • Overall blueprint, with best practices, more as an overview kubernetes-gitops-deployment-blueprint. Specifically for Data Engineering reference architecture for Kubernetes-based data platforms using GitOps workflow patterns. Includes infrastructure configs, tenant isolation, database migrations, and observability templates for production deployments.
  • CI/CD implementation with GitOps infrastructure example using Flux CD, Kestra workflows, and Liquibase migrations with complete CI/CD pipeline implementation: gitops-flux-pipeline-showcase
  • Or my data-engineering-devops repo I used four years ago back for my data engineering project, a full stack data engineering tools and infrastructure set-up with druid, kubernetes, minio-s3, notebooks, spark, and superset.

# Infrastructure as Code

Infrastructure as Code (IaC) has evolved beyond bare provisioning to include policy as code, security as code, and compliance as code. Understanding tools like Terraform, Pulumi, Helm Charts, Kubernetes, and how they integrate with specialized data infrastructure.

# Alternatives

So, with these downsides, what are the alternatives?

Suppose you choose a vendor or Hyperscalers with Closed-Source Data Platforms that include all your tools. However, you are then locked in and can’t extend features on top of closed-source tools. So it’s always a trade-off.

Also, after a while, the DevOps deployment stabilizes, and you need to invest less time.

# Further Reads

  • Declarative Data Stack: The new shift to have end-to-end stacks in a single YAML file.
  • DataOps combines DevOps practices with data analytics, focusing on improving the quality and reducing the cycle time of data analytics.
  • Shift Left: Implementing security controls earlier in the development lifecycle (“shifting left”) is becoming essential. This includes integrating security scanning into CI/CD pipelines, implementing data access governance as code, and employing techniques such as data masking in non-production environments.
  • Developer Experience: GitOps can increase increase data teams’ productivity.
  • Dagster git on Azure DevOps

Origin: GitOps
References: Codespaces, Devcontainers (devcontainer.json)
Created 2025-05-14