🧠Second Brain
Search
Data Engineering: Trends and Predictions for 2023 / 2024 🔮
Here I’ll list my predictions and trends I see. It’s also interesting to see how I predicted the year before.
For example, as of now, in 2024, I predicted many of the same things again from the previous year; I guess things are just moving slower. But I have to say, the fact that I still say similar things means I was on the right track, but I let every one of us judge :).
# 2024 Predictions
Predictions as of 2024-01-19.
- We’ll be back to the fundamentals and patterns we used over the years in data engineering, either with Data Modeling or other topics around the Data Engineering Lifecycle with security and data governance (including new fancy name data contracts). Software engineering practices applied to data engineering will increase even more.
- Open/Modern Data Data Stack will be more widely used, especially in Europe and non-US countries. But more parts of the stack will die due to economic challenges.
- Data Lake Table Format will be used more heavily as prominent vendors are integrating them in their platforms (Snowflake with the Iceberg Table, and Microsoft with the Delta Format powering their Microsoft Fabric)—creating Open Standards for the ecosystem.
- Many people will use more data due to the AI hype as they notice they need regularly updated and good-quality data to do valuable predictions or any generative AI use cases.
- The declarative trends continue with more tools from the modern data stack betting on it. For example, Kestra is a full-fledged YAML orchestrator, Rill Developer is a BI tool as code, dlt is a data integration as code, and many more introduce models; interestingly, many of them use DuckDB under the hood.
- GDPR is still ongoing, especially with the Google Analytics 4 change. Many companies have changed to fully anonymized analytics, which is good. I have used GoatCounter for years.
- The Rust hype in data engineering is less loud, but many rewrites are undergoing, and the complete ones dominate the market. ruff is a linter that is 10-100x faster than the defaults, Polars is the fastest data frame out there, many times faster than Pandas, and so on.
- The Semantic Layer is still to find its place in the business intelligence world, but the concept will spread more due to its leading forces like Cube and DBT.
A recent poll by Eckerson Group showcased that 85% of respondents believe we need more data engineers in 2024, not less. Data democratization and AI initiatives make data engineering one of the busiest jobs in tech - even as GenAI boosts productivity. Oliver Molander, LinkedIn
# 2023
# Trends
As of 2023-05-19. Tweet, LinkedIn and Reddit Discussion.
2022:
- declarative approach everywhere (from Kubernetes where we have code as infra, we have orchestration as code, we have integration as code with our low-code, and it goes in every discipline).
- same underlying approach with the rise of the Semantic layer (basically a declarative approach for Metrics)
- and metadata trends are constantly growing with tools for data cataloging, data lineage, and data discovery.
- Rust will be the future of performance-intense applications in data. It most probably will be used as Spark today.
- Vector Databases such as DuckDB are here for small data. And newer ones especially supporting the AI wave behind the curtains with pinecone, Qdrant, etc. AI is always a data game, and not the other way around.
- With regulations like GDPR and CCPA, privacy and governance are rising in every small to big company.
2023:
- data modeling comes back with the exposing of the MDS, and people started to create a mess, data modeling and modeling in general will help on all levels.
- Also, people in enterprises cannot grasp how to use MDS
- AI and generative AI with chatGPT. Still needs to find its way into data, but many are focusing on it to make sense of it (more hype atm than anything else)
- The year of bundling of MDS. Startups getting layoffs and bundled (Transform into dbt, Layoffs across MDS stack)
# Predictions
Some predictions/anticipations of mine 🔮:
- I agree that Rust is becoming more mainstream as a data engineering language
- Spark will compete with Rust Ballista/arrow/data fusion
- Modern Data Stack will rename and be more known outside of the data bubble
- Declarative orchestration will be an acknowledged key component
- Semantic Layers will gain adoption with dbt and cube
- DuckDB will be the new standard when working with data
- Open standards will be a key to all of the above and will consolidate MDS into a few key components
Predictions of Zach Wilson on “My bold 5 year predictions about #dataengineering “:
- Streaming data eng jobs account for 15-20% of all data eng jobs, but pay the most
- Rust becomes a mainstream data engineering language
- Spark starts looking like Hive does now
- Data engineers will need to grow broadly into either #dataanalytics or #softwareengineering to stay competitive
- We’ll start seeing blockchain data engineer roles which require a firm understanding of smart contracts, distributed compute, and #machinelearning.
See also The State of Data Engineering.
Origin:
Benjamin Rogojan on LinkedIn: FACT! 2022 is coming to an end. What is the state of data infra? | 24 comments
References:
12 Things You Need to Know to Become a Better Data Engineer in 2023 | Airbyte
Created 2022-11-16