Search

Search IconIcon to open search

Data Engineering: Trends and Predictions (2022-2026) 🔮

Last updatedUpdated: by Simon Späti · CreatedCreated:

Here I’ll list my predictions and trends, mainly for myself and to keep a record of what’s happened in the past.

It helps me, and potentially you, to see that the higher-level changes are not that big. We still need Data Engineering Fundamentals, and most things predicted or hyped for next year might just swing back a year or two later. So don’t go for the Hype Cycle of Data Engineering, but observe the market, and make a decision based on the examination you made for yourself and your company. The Data Engineering Toolkit I wrote, is still relevant for most data work, which included fundamentals more than just new shiny tools.

For example, in 2024, I am predicting many of the same things as the year before. I guess things are just moving more slowly. Being slower can also save a lot of money, as you don’t need to upgrade every hype cycle.

Preliminary thoughts below

Again, the notes below are somewhat preliminary, though, and random updates for myself. Also, what are your predictions?

# 2026 Predictions

Preliminary thoughts on the field as of 2025-11-27:
I believe in 2025, AI was more slop than anything else. Obviously, we need to distinguish: Claude Code are great helpers, but every layer above is mostly just a wrapper. In 2026, I predict and hope we get less hype and bring actual value. Instead of putting AI everywhere in the name or sales pitch, as it’s a tool and not the selling criterion, this will help the whole industry. People will realise a little more that it’s not the silver bullet everyone thought it was.

On that note, we’ll see more guardrails and ways to manage AI agents, as in the old days with Master Data Management, where people approved and stewarded data before it was allowed into production. The problem is that agentic generations are so easy and fast that the verification process is very hard to keep up with.

# Interview About State and Future

This is an interview style state about data engineering, and answers many questions regarding current and future trend. Heavily relates to Data Engineering- Trends and Predictions.

The answers are from 2026-01-27.

How did you get into Data Engineering and what do you like about it?

I started with classic Business Intelligence and DWH Developer work in 2003, right after my apprenticeship. It then evolved towards Data Engineering - Python and more programming - and away from SSIS and classic tools. When I lived in Copenhagen and worked at Airbus (Satair), I went to Toulouse for a hackathon to do something with Flightradar24 data with people from around the world. That was kind of the start.

The currently trending Data Engineering tools are all SaaS products that rely on scalable cloud resources. Innovations are being pushed there and a lot of capital is flowing into these solutions. You regularly write about the “Open Data Stack”, based on Open Source tools - so somehow at the other end of the spectrum. Why does your fire burn for this topic?

Most SaaS products have an open-source offering in the data engineering space. But I especially like the open-source part because I got somewhat burned with the GUI drag-and-drop tools, where you could hardly automate anything without investing hours of mouse clicks. Open-source tools are programmed with Python, but allow a lot of automation with code. Which really appealed to me. Plus if you take dbt, you get it “for free” and you can basically replace SSIS, plus you can automate even more. That’s a bit exaggerated of course, it’s not all as rosy as it sounds, but the fact that you can so quickly have an entire stack that you previously had to buy expensively from Oracle, SAP or Microsoft fascinated me a lot, and even more today. Although the trend is currently swinging back to the other side.

What significance do the tools from the Open Data Stack have compared to the big cloud solutions? How do you see the future development here? Will everything soon be just Lakehouse / Databricks?

Lakehouse is a bit of a buzzword. I think everything is going back towards Data Warehouse, because everyone needs joins and mostly the speed on Data Lakes is too slow. But there are exciting solutions, where you can put an OLAP cube on S3 data, or others that also solve these problems.

But the trend is definitely towards consolidation, especially with the Fivetran + dbt Merger. But I think that’s less of a technical nature, certainly also, but mainly because the customer has to talk and negotiate with X vendors, plus the integration can be difficult or constantly changing.

What I find most exciting behind the whole Lakehouse architecture are the Open Table Formats like Delta, Iceberg and Hudi (blueprint etc.). Because these store the data in an open format that is accessible to everyone. So no lock-in, not only for the compute, since you can now use DuckDB, Spark, whatever, but also the data itself is not stored in a MongoDB or other proprietary format, where only that DB can access it. This has many advantages, but also brings disadvantages in speed and access management.

Have LLMs and Vibe Coding put the Open Data Stack at a disadvantage?

Rather no. What’s important though is that you use an Open Data Stack that is based on configuration files. I call these Declarative Data Stacks. Because now you can simply automate the entire Data Engineering Lifecycle with AI agents, entire transformations, BI dashboards, ingestions etc., since these are just config files.

If a company introduces a form of the Open Data Stack, 3-5 solutions must be operated, which are also partly deployed in different ways. Is this still maintainable for smaller and medium-sized companies?

That’s a good question. I think DevOps should not be underestimated. I write on my blog “Is DevOps the new data engineering of data science”.

Mostly in enterprises there are now DevOps teams or Kubernetes experts, so that certainly helps. But for smaller companies without this know-how, especially if the Data Engineer doesn’t know it, I would very quickly go to a managed solution. Or if it’s not critical, deploy something simple without Kubernetes.

New trends, architectures and “in-tools” appear with high regularity. Should Data Engineers build their stack modularly and regularly swap out components? Or is it worth staying conservative, since certain concepts that existed before will come back anyway?

Modular is certainly good. You have to do that anyway, because mostly more than one tool is used. Usually you end up with an architecture where you use DE Workspaces to decouple the business logic and partly dependencies from the infrastructure and deployment logic.

My advice is always, start with 2-3 tools, find out which one fits best, and then stick with it. Don’t take the newest, but also not the oldest. For example, with the orchestrator many take Airflow because it’s certainly the most widespread, but I think there are now much better ones like Prefect, Dagster, Kestra, etc.

And yes, the concepts and requirements don’t change. An old-school data modeling session before you start programming directly helps enormously, which is increasingly forgotten nowadays.

Some Open Source solutions are maintained by companies that cover their costs either with a freemium model or a cloud offering. As a customer, it’s good to know that there is not only a community behind a solution, but also a company. What dangers do you see in this? Is there a concern that the actually needed features will then be added as proprietary and customers will end up in a dependency with the freemium model after all?

Yes, that certainly needs to be carefully considered. I always think like this: “Are today’s features of the tool enough for me, or am I dependent on all the new features that come”. Meaning, if you choose a tool because it has good features today and not in the future, I think the danger is small. Because even if there’s a license change or other unforeseen circumstances, usually what was once open source stays open source. And if it’s no longer maintained, at least you have a good tool, plus the code, meaning you can also continue making updates yourself. Which is not the case with a purchased tool, and those can also make strategic changes.

So a certain risk always remains of course, but I think less than if you make yourself completely dependent on a vendor and implement everything proprietary.

You also contribute to the solution kanton-bern.github.io/hellodata-be/. How did you come to this initiative?

I was at Bedag Solutions AG for a year because HelloDATA convinced me, and it exactly removes the disadvantages mentioned above. It consolidates the most important tools like dbt, Airflow, Superset, Jupyter Notebook and many more into a unified web portal with unified access management.

Does the integrated approach there solve the “Maintainability” problem?

If it’s open source yes, because now on one hand a company maintains it, Bedag, and on the other hand anyone, meaning the community, can report bugs via Pull Requests or Issues on GitHub.

But it’s clear, such projects are always complex. And the crucial point is in deployment. And there you need a lot of time to do this. That’s why it can make a lot of sense for small companies to have this done, and possibly also build up know-how with it.

Where do you see this initiative in its lifecycle and where is the journey going?

Of course AI is changing everything right now. At least the “perception”. I think in the background quite a bit is changing, but maybe less than you think. The Data Engineering Lifecycle stays the same, we need to integrate data into a Data Warehouse, aggregate for fast analyses and present insights quickly and cleanly, so that 1000 Excel files and hours of your own people don’t have to be used :)

But yes, I assume we’ll see many assistants that will support us very strongly, and that projects will increasingly rely on a declarative approach (configs, markdown, open data), because then the AI agents also have much more context and can also do much more autonomously. But it will certainly also need local models first, so that all secrets aren’t uploaded somewhere.

When I see HelloDATA’s approach, it reminds me of https://www.opendesk.eu/de for Collaboration or https://www.openstack.org/ for Cloud Provider. There, established Open Source solutions are also bundled into an overall solution. That seems to be professionally set up, is state-funded and is meeting with great interest in the context of the current trend “Digital Sovereignty”. Couldn’t something similar emerge from HelloDATA for an open, integrated data platform?

HelloDATA is exactly that, in the Canton of Bern this is the official tool for working with data. I think these political initiatives that increasingly use Open Source, and even must, are very good. And also that you then make the software Open Source, so that others can benefit from it. In a perfect world, we would all build together on one tool, instead of 1000 tools that do the same thing. I agree with you 100% on that.

What would it take to increase its adoption and also have a global reach?

On one hand, it must be easier to deploy such solutions. Meaning, it’s very complex, and in case of an error you have to debug through multiple layers. You need a lot of know-how, and that in many areas like DevOps, Data Engineering. You have to know every tool and its peculiarities, plus update this software every month, etc.

On the other hand, there also needs to be communication about what these tools can do. Education. I think when someone sees this, the possibilities and functions, and it’s Open Source and managed by Bedag, I can hardly imagine how you would then run to something else. But yes, it does require enormous know-how to even understand what HelloDATA and other platforms are.

# 2025 Observations

A couple of preliminary thoughts on the field as of 2025-11-27: More DevOps is one change I saw, see more in The State of DevOps in Data Engineering.

Another one are Small Data Stacks powered by CLIs like Rill, dbc, dlt, DuckDB, all small and tiny binaries or CLIs that can plug and play into any stack are super powerful and a theme I’d say I saw more in 2025 on one side, whereas we also see more unified data platforms on the other side.

The pendulum swang back again between Bundling vs Unbundling: Currently, with the recent Data Engineering Acquisitions, especially around Fivetran, Databricks, and Snowflake, which seem to buy everything they can, it’s shifting more to a consolidated Data Platform. With DevOps trends, the aim for a simpler, just-out-of-the-box working data platform has risen as well. Compared to talking to 4-5 vendors, you get it all from one, like back in the days when we used SAP, Oracle, and Microsoft.

To be continued..

# 2024 Random thoughts

Data engineering is still going strong. Stronger than ever, I’d say, especially since every industry focuses on data. AI won’t take our jobs; the opposite, as there will be more chaos, and people who know how to model the data and its flow, understand the business requirements and can deliver high-quality insights will always be used.

What will change, though, is how we learn— how we use the data. Once we’ve centralized, cleaned, and automated the data, we can do cool stuff with advanced technology. The presentation will be more “fancy,” hopefully more insightful, and easier to understand. Throughout my career, a key task has always been to present the data understandably. Because no matter how fancy your pipeline, tooling, or even your profound insights, if the presentation is not up to it or the data quality is terrible, no one cares.


On the other hand, I’m super stoked about how far we’ve come tooling-wise. I still remember the times vividly when I was creating the same ETL pipeline, either with PL-/T-SQL or sometimes in bash, in every company again. Sometimes, I still feel we’re in the same loophole and building the same things to this day. But zooming out, it’s clear that open source has come a long way.

I can bring us Airbyte and have a full-blown ingestion tool, I can use dagster for orchestration that has all necessary functions backed in (and much better than one of us alone could build), and I can choose any open-source BI tool and visualize it. All for “free”.

This shortcut is mind-blowing to me. The fun part starts for us engineers when you want to bring these tools together. But if you are mindful, this is a much better starting point than starting from scratch at every company, where the integration must also be made. But this time, we can use modern languages like Python or Rust instead of bash :).

LinkedIn Post, Tweet

# 2024 Predictions

Predictions as of 2024-01-19.

  • We’ll be back to the fundamentals and patterns we used over the years in data engineering, either with Data Modeling or other topics around the Data Engineering Lifecycle with security and data governance (including new fancy name data contracts). Software engineering practices applied to data engineering will increase even more.
  • Open/Modern Data Data Stack will be more widely used, especially in Europe and non-US countries. But more parts of the stack will die due to economic challenges.
  • Data Lake Table Format will be used more heavily as prominent vendors are integrating them in their platforms (Snowflake with the Iceberg Table, and Microsoft with the Delta Format powering their Microsoft Fabric)—creating Open Standards for the ecosystem.
  • Many people will use more data due to the AI hype as they notice they need regularly updated and good-quality data to do valuable predictions or any generative AI use cases.
  • The declarative trends continue with more tools from the modern data stack betting on it. For example, Kestra is a full-fledged YAML orchestrator, Rill Developer is a BI tool as code, dlt is a data integration as code, and many more introduce models; interestingly, many of them use DuckDB under the hood.
  • GDPR is still ongoing, especially with the Google Analytics 4 change. Many companies have changed to fully anonymized analytics, which is good. I have used GoatCounter for years.
  • The Rust hype in data engineering is less loud, but many rewrites are undergoing, and the complete ones dominate the market. ruff is a linter that is 10-100x faster than the defaults, Polars is the fastest data frame out there, many times faster than Pandas, and so on.
  • The Semantic Layer is still to find its place in the business intelligence world, but the concept will spread more due to its leading forces like Cube and DBT.

A recent poll by Eckerson Group showcased that 85% of respondents believe we need more data engineers in 2024, not less. Data democratization and AI initiatives make data engineering one of the busiest jobs in tech - even as GenAI boosts productivity. Oliver Molander, LinkedIn

# 2023 Predictions and Anticipations

Anticipations:J

  • data modeling comes back with the exposing of the MDS, and people started to create a mess, data modeling and modeling in general will help on all levels.
  • Also, people in  enterprises cannot grasp how to use MDS
  • AI and generative AI with chatGPT. Still needs to find its way into data, but many are focusing on it to make sense of it (more hype atm than anything else)
  • The year of bundling of MDS. Startups getting layoffs and bundled (Transform into dbt, Layoffs across MDS stack)

Some more predictions/anticipations of mine 🔮:

  • I agree that Rust is becoming more mainstream as a data engineering language
  • Spark will compete with Rust Ballista/arrow/data fusion
  • Modern Data Stack will rename and be more known outside of the data bubble
  • Declarative orchestration will be an acknowledged key component
  • Semantic Layers will gain adoption with dbt and cube
  • DuckDB will be the new standard when working with data
  • Open standards will be a key to all of the above and will consolidate MDS into a few key components

Predictions of Zach Wilson on “My bold 5 year predictions about  #dataengineering “:

  • Streaming data eng jobs account for 15-20% of all data eng jobs, but pay the most
  • Rust becomes a mainstream data engineering language
  • Spark starts looking like Hive does now
  • Data engineers will need to grow broadly into either  #dataanalytics or  #softwareengineering to stay competitive
  • We’ll start seeing blockchain data engineer roles which require a firm understanding of smart contracts, distributed compute, and  #machinelearning.

As of 2022-10-16: Twitter/ LinkedIn:

  • declarative approach everywhere (from Kubernetes where we have code as infra, we have  orchestration as code, we have  integration as code with our low-code, and it goes in every discipline).
  • same underlying approach with the  rise of the Semantic layer (basically a declarative approach for Metrics)
  • and metadata trends are constantly growing with tools for data cataloging, data lineage, and data discovery.
  • Rust will be the future of performance-intense applications in data. It most probably will be used as Spark today.
  • Vector Databases such as  DuckDB are here for small data. And newer ones especially supporting the AI wave behind the curtains with pinecone, Qdrant, etc. AI is always a data game, and not the other way around.
  • With regulations like GDPR and CCPA, privacy and governance are rising in every small to big company.

As of 2023-05-19. Tweet, LinkedIn and Reddit Discussion.

# Further Reads


Origin: Benjamin Rogojan on LinkedIn: FACT! 2022 is coming to an end. What is the state of data infra? | 24 comments
References: 12 Things You Need to Know to Become a Better Data Engineer in 2023 | Airbyte