SELECT Insights - What the Heck Is the Open Data Stack?
This week’s SELECT Topic is the Open Data Stack. This is something I have written before, but I’d like to share my view on the differences between the Modern Data Stack and what I see behind the term.
# SELECT Open Data Stack - Shaping the Data Engineering Landscape with Open Standards [4 min read]
I kick things off with a thought-provoking topic, ranging from the intricacies of data engineering to the fascinating charm of everyday life experiences.
As data engineering evolves, it’s crucial to understand the potential of open-source solutions that adhere to open standards, providing an integral and adaptable framework known as the Open Data Stack. This approach outshines the term ‘Modern Data Stack,’ which has garnered some negativity and created confusion within the industry. So, what exactly is the Open Data Stack, and why should we focus on it?
The Open Data Stack addresses all aspects of the Data Engineering Lifecycle. While it shares the same goal as the Modern Data Stack, the Open Data Stack provides better tool integration due to its open nature, resulting in greater usability for data practitioners. The key word here is open, which implies that the tools or frameworks employed are either open-source or compliant with Open Standards. This openness facilitates tools like Dremio, a data lakehouse platform. While Dremio itself isn’t open source, it operates based on open standards like Apache Iceberg and Apache Arrow, allowing for seamless integration without vendor lock-in for larger organizations.
Some data practitioners propose alternative names such as ’ngods (new generation open-source data stack)’, ‘DataStack 2.0’’, or ‘DAD Stack’. Nonetheless, the essence remains the same: better, more integrated tooling is used by more individuals within every company that can genuinely comprehend the data it manipulates. This contemporary data stack will significantly differ from its predecessors.
It’s important to note the core distinction between ‘old data stack vs. modern data stack,’ monolith vs. microservices’, or ‘orchestrations vs choreography’. These terms illustrate the ongoing shift from Monolith Data, bundled solutions to a more microservices-driven, unbundled approach. This shift enables data pipelines to operate as ‘microservices on steroids,’ improving scalability and alignment across various code services.
The Open Data Stack is accessible and maintained by all users, fostering an environment where companies can leverage existing, tested solutions instead of re-implementing critical components for each data stack component. The ‘open’ aspect makes this stack embeddable with various tools, unlike closed-source services. This level of integration allows you to easily incorporate tools like Airbyte, dbt, Dagster, Superset, and more into your services.
You might wonder which ones to adopt in a world with over 100 tools. This guide introduces the Open Data Stack, emphasizing the advantages of reusing and building upon existing solutions. With the open data stack approach, you no longer need to write custom code for each step of the data engineering lifecycle. Instead, the Open Data Stack allows you to quickly address common challenges, ranging from data extraction and visualization to monitoring and scaling. Consequently, the Open Data Stack ensures that you’re not reinventing the wheel but building upon proven foundations to expedite and streamline your data engineering processes.
Whether you’re just starting to delve into the field or seeking to enhance your existing knowledge, the Open Data Stack provides robust, flexible solutions to help you master data engineering. Check out the GitHub repo Open-Data-Stack to see the Open Data Stack in action and start your journey toward a more effective and efficient data-driven operation.
From the Keynote The End of the Road for the Modern Data Stack You Know by DBT Labs
Better, more integrated tooling, used by more humans inside of every company, that actually understands the data that it is operating on.
This modern data stack—if we still want to call it that!—will be unrecognizable to its former self.
# UPDATE Engineering - Latest Updates in Data Engineering Tools and Techniques [5 min read]
In this bustling realm of data engineering, let’s take a look at the recent updates that caught my attention:
- Rill Data introduced Rill Cloud, promising a blazing fast journey from your Data Lake to Dashboard.
- Ever wondered if you could bypass the Modern Data Stack? Hightouch has an interesting perspective on this.
- Cube continues to make strides in syncing metrics across platforms with their latest Semantic Layer Sync. It’s turning the dream of seamless syncing into Superset, Metabase, PowerBI, etc. into a reality. And it doesn’t stop there; they’ve also introduced an Orchestration API.
- In the realm of data orchestration, Dagster shared their ambitious master plan for the standard over time. Engage in the ongoing discussion on Reddit.
- Dagster also revisited the concept of the ‘Poor Man’s Data Lake’ with MotherDuck.
- I came across a live example of how Carrefour manages their modern data platform. You can unlock more about Modern Data Product Production and Platform Engineering at Carrefour here.
- Rust lovers, there’s another book out there just for you. Check out Rust For Data.
- Decube shared their journey of migrating from Apache Airflow to Dagster. Find out why they made the shift here.
- On another note, MotherDuck was announced: a hybrid execution that scales DuckDB from your laptop into the cloud. Check out their announcement, or if you prefer video, here’s a quick summary in 100 seconds.
- Data Lake / Lakehouse enthusiasts, we have discussions about Parquet File Format here, the victory of Iceberg in the table format war here, and a recap on data lake and lakehouse foundations here.
- Delta Lake v3.0 has been announced, with support for the Apache Iceberg and Apache Hudi data lake platforms.
- The journey of Data & Data Engineering — past, present, and future, wonderfully articulated by Zach Wilson here.
- A fascinating three-part series around Write-Audit-Publish (WAP) here.
- Explore the new way of data pipelining with Imperative vs Declarative here.
- For those curious about Microsoft Fabric’s “Direct Lake”, this article clears some misconceptions related to PowerBI Direct Lake.
- Learn about the origins of Apache Arrow and its fit in today’s data landscape in this insightful article by Dremio.
- Databricks is stepping up its game with the introduction of Materialized Views and Streaming Tables for Databricks SQL.
- An awesome example of ROAPI, a Semantic Layer on top of Delta Lake, can be found here.
- As Whatnot’s data stack grows with its rapid expansion, they’ve placed their bet on data contracts. Learn more about their journey here.
- dbt has released a new Semantic Layer specification, which they believe will be the DNA for their vision. Find out more about their vision here.
- For those who are new to data engineering or those who want to solidify their understanding of various terms, Dagster has created a comprehensive Data Engineering Glossary. For example, get the explanation of the term ‘Fan-Out’ here.
# AI Specific
- Explore the rise of the AI Engineer here.
- AI enthusiasts, we have a comprehensive overview of emerging architectures for AI and LLM Applications here.
# JOIN Perspectives [2 min read]
Where nerdy pursuits like blogging, Neovim, dotfiles, and coding intersect with life’s subtle nuances and diverse worldviews.
- The power of writing: A fascinating article discusses why writing is such a powerful tool, suggesting that writing well can feel like a superpower. It inspired me to explore the reasons why I write too.
- I delved into some intriguing ideas on Continuous Notes and Types of Content Online, thinking about what the Future of Blogging might look like.
- In my constant pursuit of improving my Neovim setup, I found a couple of exciting plugins:
- Have you ever undone something and wished you could see what it was? The tzachar/highlight-undo.nvim plugin is a nifty addition that highlights your changes.
- For faster in-file search,
flash.nvim is a game-changer, echoing the function of the
Vimium Chrome Extension. Hit
f, and all the links in your file are within a single key combination’s reach. Check out this YouTube video for a quick rundown.
- Finally, Josh Medeski hosted his Dev Workflow Guide’s Launch Party!, where he opened the curtain on some of his personal workflows. Get a sneak peek into his methods with his Dev Workflow Intro Blog.
# FETCH Socials - Conversations Stirring Up The Digital World [1 min read]
This is the space where I share intriguing conversations, trending topics, and powerful ideas from around the social media landscape.
In this month’s social fetch, here are some posts that caught my attention:
- Marc Garcia showcased the elegance and speed of Polars pipelines, along with the smooth integration with Plotly. You can check out his post here.
- Glossaries can be fantastic tools for understanding and learning about a specific field. In that vein, Dagster has recently published a Data Engineering Glossary that’s a treasure trove of useful terminology. If you’re interested, I’ve also compiled five more data glossaries at Data Engineering Glossaries. Feel free to join the discussion on LinkedIn or Twitter.
- For those following the Data & AI Summit 2023 US, there’s a new update available. You can get all the details in this Tweet. Also, waitingforcode.com provides a neat article on What’s new in Apache Spark 3.4.0, which can be a useful read for those interested in Spark.
# SCAN Books - Through The Lens of Written Papers [1 min read]
Every book opens up a new world of insights and perspectives. Here, I’ll share some of my recent reads across a spectrum of topics. Let’s explore these new horizons together!
- The Good Enough Job (Simone Stolzoff): A compelling book that challenges our conventional thinking about work. Instead of idolizing our jobs or incessantly chasing a better one, Stolzoff advocates for finding satisfaction in a “good enough” job. His ideas offer a refreshing contrast to the pervasive Instagram-era narrative that equates career success with personal fulfillment. A highly recommended read for anyone feeling pressured by the modern-day cult of work.
- James Serra’s upcoming book on Data Architecture: James Serra, an influential figure in the field of data architecture, recently announced that he’s working on a new book titled “Deciphering Data Architecture”. I’m excited to see what he brings to the table. You can follow his progress here.
- The Daily Dad by Ryan Holidays: This book is a sequel to one of my all-time favorite reads, The Daily Stoic (366 Meditations) - Ryan Holidays. In the same tradition of offering daily philosophical advice, this book focuses on the challenges and rewards of parenthood. I haven’t finished it yet, but it’s already proving to be a valuable resource.
If you’ve made it this far, thank you for reading! Your thoughts and feedback are invaluable to me, so don’t hesitate to let me know what you’d like to see more or less of. I’m always open to suggestions on how to improve both the topics covered and the style of this newsletter.
Please note, for this particular edition of the newsletter, I’ve activated analytics. This is to ensure that my emails are reaching your inbox, especially since I’ve recently made changes to the name and sender. Rest assured, this is a one-time measure, and I will disable analytics for subsequent newsletters. Your privacy is important to me.
Until next time, happy reading and exploring!