SELECT Insights - Bundling with Microsoft Fabric and Orchestration (#2)
A quick note before we start: I changed the name of this newsletter to “SELECT Insights” and changed the structure slightly. I plan to follow this structure for the upcoming editions. If you want to know more, I have more details on newsletter.ssp.sh. — And on another exciting side note, I also move the domain from sspaeti.com to ssp.sh. I hope you like it; more on the history in this Tweet. But now, let’s get into this month’s newsletter.
This week’s SELECT Topic is Data Orchestrators. Data Orchestration is a topic we must recognize, given the recent events and trends in the data ecosystem.
# SELECT Orchestration - Dead or Alive? [4 min read]
I kick things off with a thought-provoking topic, ranging from the intricacies of data engineering to the fascinating charm of everyday life experiences.
This is interesting from two standpoints. On one side, Dagster, through their parent company Elementl, raised $33 million in Series B funding. And on the other hand, Microsoft announced its new product launch with a big bang, called Microsoft Fabric. Why is this interesting, you might as? Because on the one hand, Microsft is bundling the whole Azure data stack into a single SaaS service, essentially bundling it into PowerBI.
On the other hand, Dagster is bundling the Open Data Stack into a single control pane.
# The Data Orchestration Challenge
Data orchestration is increasingly vital as the hub of the Data Engineering Lifecycle. It combines the core aspects of data integration, transformation, and analytics in a cohesive, manageable workflow. The demand for more sophisticated orchestration tools is growing, as highlighted in the great summary of data orchestration articles I recently read.
However, I’ve observed a concerning trend - the unbundling of the orchestrator across various tools. Instead of allowing the orchestrator to do its job effectively, we are diverting away from this ideology, fragmenting the process across different devices and platforms. This is not the direction we should be heading.
Still, there is a lot of hype for now, but the good news. Microsoft, for instance, offers excellent no-code and Closed-Source Data Platforms solutions to kickstart data engineering tasks. They are betting strongly on open standards with open-source Data Lake Table Formats Delta.io and Spark for computation.
What they still need to include, though, is an open-source orchestrator. They have some closed-source solutions, but they remind me more about the bad times of SSIS, with lots of missing efficiencies.
On the flip side, the Open Data Stack, which integrates the core needs of the data engineering lifecycle, could be our trustworthy, bundled solution.
# Is the Orchestrator Dead or Alive?
This brings me to the Symposium: Is the orchestrator dead or alive? This engaging series, initiated by Stephen Bailey, invites authors from different fields to discuss the role of the orchestrator in today’s data ecosystem. Some takeaways:
- The need for speed and simplicity: The current data orchestrators should be more quick. They need to onboard use cases faster and justify displacing or running managed services through it.
- Data ingestion: This is a process that an orchestrator must own.
- Actions, not meta-narratives: GitHub Actions thrives by focusing on running things rather than what it should be.
- Integration: The most helpful orchestrator is one plugged into everything.
- Control vs. chaos: The orchestrator embraces chaos but of a certain kind: ordered chaos, not wasteful chaos.
Finally, I want to address the contention that Data Engineering is a transitional job. Data engineering is not a job but a field. It’s too broad to be confined to one role. Collaboration between BI engineers, DBAs, DevOps, data scientists, and more is necessary to leverage data effectively. After all, as we know, everyone needs data.
Feel free to dive deeper into the data orchestration discussion at Stephen Bailey’s Symposium. Also, check out the latest orchestration comparison by Christophe Blefari - Airflow alternatives Mage and Kestra, and Prefect and Dagster.
# UPDATE Engineering - Latest Updates in Data Engineering Tools and Techniques [5 min read]
In this bustling realm of data engineering, let’s take a look at the recent updates that caught my attention:
- State of the Modern Data Engineering and the Future
- LakeFS presented a comprehensive report, The State of Data Engineering 2023. I highly recommend it to stay on top of the current data engineering landscape. Read more
- Andreessen Horowitz’ overview of The Modern Transactional Stack is an excellent guide to understanding the evolution and potential of transactional systems. Read more
- Airbyte announced the biggest data engineering survey in state-of-data. Read more
- Pedram Navid shares thoughts on The Future of Data. Also, consider checking out some Reddit thoughts on the evolution and trends of data engineering in 2022/23. Read more, Reddit discussion
- Evolution and Trends of Data Engineering 2022/23. Read more
- Data Modeling, the unsung hero of data engineering, got a spotlight in the recent piece by Airbyte. Don’t miss this read if you want to appreciate the fine art of designing your database schema. Read more
- Curious about “Casual data engineering”? Tobi Lütke’s take on building a “poor man’s Data Lake in the cloud” using Delta Lake with DuckDB is an intriguing read. Read more
- A leaked Google document surfaced with some daring words: “We Have No Moat, And Neither Does OpenAI”. Simon Willison shared this treasure trove on Twitter, suggesting that it’s worth every bit of our attention. Read more
- The dbt Developer Blog showcased how to build a Kimball dimensional model with dbt. It’s a fundamental tutorial for anyone looking to embrace the Kimball methodology. Read more
- The tech community has been raving about Microsoft Fabric. To fully grasp the potential and implications of this release, here are some insightful articles:
- ChatGPT just became a data scientist’s ally. With support for Postgres and Supabase, it’s taking another leap into our daily data routines. A must-try for data enthusiasts! Read more
- Super Tables: The road to building reliable and discoverable data products. It’s an older concept that competes with Materialized views, OBT, and even says enables data mesh. Read More, Discuss
- Dive into the future of data analysis with Pandas AI, an innovative approach to data manipulation. Also, explore ‘Sketch’, an initiative to integrate an LLM into Pandas. Read more, Sketch Github
- Christopher White shares insights about scaling Postgres in his blog titled More Memory, More Problems. The blog gives a detailed look at memory-related issues that come with scaling Postgres. Read more
- Kai Waehner’s take on why Kappa Architecture is becoming mainstream and replacing Lambda is worth a read. Read more
- Are you interested in a cost-effective solution for Change Data Capture that provides near real-time data pipeline management? Check out how Kestra and Debezium are integrated to capture database changes without Kafka Connect. Read more
# JOIN Perspectives [2 min read]
Where nerdy pursuits like blogging, neovim, dotfiles, and coding intersect with life’s subtle nuances and diverse worldviews.
- I am exploring the Future of Blogging and leaning heavily towards Markdown. With platforms like Substack and Plaintext Files, the landscape of blogging is undoubtedly changing, bringing simplicity and control back to the writer. See the difference between Markdown vs Rich Text. In case you want to build your own blogging site, check out Public Second Brain with Quartz.
- A thought-provoking article from Basecamp, suggested that group chat might not be the most effective communication tool for teams. This got me pondering. Could this actually be the best way to totally stress out your team? Take a look and form your opinion: Group Chat: The Best Way to Totally Stress Out Your Team.
- Liking Markdown? Lucky you, there is a markdown solution that does business intelligence, called Evidence. Hackernews, Read more
- Cool features of Arc Browser called Zap, let’s you remove annoying parts of the website. I showcased before and after for YouTube, Reddit, and Twitter.
- I stumbled upon this enlightening perspective, “Write to Yourself and Yourself Alone”. There’s something invigorating about the idea of writing for oneself, instead of a perceived audience. It’s like a breath of fresh air, reminding us to stay genuine and authentic in our writing. Read more here.
- Do you have a soft spot for Markdown too? If yes, there’s something exciting to share! Evidence has brought together the best of both worlds – SQL with Markdown. Business Intelligence is getting a markdown makeover! Follow the buzz on Hackernews or explore more in their latest newsletter.
# FETCH Socials - Conversations Stirring Up The Digital World [1 min read]
This is the space where I share intriguing conversations, trending topics, and powerful ideas from around the social media landscape.
- A fascinating discussion on LinkedIn about Rust, Ballista, Ray SQL, and Data Fusion with Andy Grove is a great place to gain insights about these trending technologies. LinkedIn
- Are you interested in reading and writing Delta without Spark? Here’s how Delta-RS and DuckDB have made it possible. A significant advancement in data engineering, making data handling more efficient. Read more, Discuss on Twitter
- Interesting Data Engineering Podcasts. Reddit
- DBT lays off 15% of their staff. Reddit
- So watched a few videos about Fabric, and started to cry a little… Reddit
- Why I left Rust. Some drama in the Rust community. Hacker News
# SCAN Books - Through The Lens of Written Papers [1 min read]
Every book opens up a new world of insights and perspectives. Here, I’ll share some of my recent reads across a spectrum of topics. Let’s explore these new horizons together!
The Extended Mind by Annie Murphy Paul – An exploration of the intriguing ways our environment influences our thinking processes.
- Discover the surprising ways in which experts think beyond their brains, how harder thinking often leads to fewer results and the controversy over brain-training games and smart pills.
- Explore how we can use tools beyond the brain, such as the Body Scan technique and meditation, to tap into our intuition and sensations, and learn about the significance of the amygdala in our responses to stress.
- Question the nature of intuition and its relationship to extensive training, all in the context of Thinking, Fast and Slow (Daniel Kahneman)concepts of System 1 and System 2.
All the Wrong Moves by Sasha Chapin – A captivating narrative where the world of chess serves as a backdrop for introspection and self-discovery.
- A story that shows how the love of chess can fully dominates one’s life. As it’s the most beautiful and worst thing in life at the same time.
If you are still reading, thanks so much. I hope you enjoyed this update! It turned out longer than I expected (as always). Let me know what you want more or less; happy to take feedback to improve the topics and style to your liking.
Until next time, happy reading and exploring!