<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
    <channel>
        <title>Data Engineering Blog</title>
        <link>https://www.ssp.sh/</link>
        <description>Genuine News About the Data Ecosystem</description>
        <generator>Hugo -- gohugo.io</generator><language>en</language><managingEditor>hello@sspaeti.com (Simon Späti)</managingEditor>
            <webMaster>hello@sspaeti.com (Simon Späti)</webMaster><copyright>All rights reserved. Sharing of excerpts with proper attribution is encouraged for non-commercial purposes. For commercial use or republication, please contact hello@sspaeti.com.</copyright><lastBuildDate>Tue, 02 Jun 2026 08:00:08 &#43;0200</lastBuildDate>
            <atom:link href="https://www.ssp.sh/index.xml" rel="self" type="application/rss+xml" />
        <item>
    <title>The Dagster Almanack: From Complexity to Composability</title>
    <link>https://www.ssp.sh/blog/dagster-almanack-open-data-platform/</link>
    <pubDate>Tue, 26 May 2026 08:00:08 &#43;0200</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/dagster-almanack-open-data-platform/</guid><enclosure url="https://www.ssp.sh/blog/dagster-almanack-open-data-platform/featured-image.png" type="image/png" length="0" /><description><![CDATA[<p>I have read the &ldquo;Poor Charlie&rsquo;s Almanack&rdquo; by Charlie Munger and thought about what it would take to write one for Dagster. A complete guide with all the insights, tips, and some predictions for the data platform engineer, just like an Almanack provides, with practical information for daily life.</p>
<p>My goal is to offer a collection of wisdom, insights, and principles gathered over the years. Giving you an outside view from someone who has used Dagster since back in 2019, used it at enterprise scale but also for my hobby projects (e.g. <a href="https://github.com/ssp-data/practical-data-engineering" target="_blank" rel="noopener noreffer">real-estate project</a>). The piece should give you a holistic view of Dagster&rsquo;s place in the data ecosystem, how to deal with the complexity of data architecture and enterprises, and scaling your data jobs.</p>
<p>This article shows you how orchestrators such as Dagster are built for an open data platform that integrates the full data ecosystem, with the shift to data assets instead of DAGs, reducing complexity and applying data engineering best practices.</p>
<blockquote>
<p>[!note] Definition of Almanack (also spelled &ldquo;almanac&rdquo;)<br>
The term refers to a publication containing a variety of information on a dedicated topic. The modern usage of Almanack, particularly in the context of books like those by Charlie or Naval Ravikant, is often metaphorical. It suggests a collection of wisdom, insights, or principles gathered over time.</p>
</blockquote>
<h2 id="what-is-dagster">What is Dagster</h2>
<p>In late 2018, on a co-working and co-living sabbatical in Bali, I was searching for something to bring the data warehouse out of the drag-and-drop world of SSIS and Oracle reporting and into a code-first, developer-friendly workflow. I looked at <a href="https://github.com/OptimalBI/optimal-data-engine-mssql" target="_blank" rel="noopener noreffer">ODE</a>, BiGenius, TimeXtender, and WhereScape, but found that none of them quite fit my open source and programmatic preferences, so I tried to build something myself but didn&rsquo;t succeed. A year later, back at my 9-to-5 in Copenhagen, I heard Nick Schrock on the <a href="https://www.dataengineeringpodcast.com/dagster-data-applications-episode-104/" target="_blank" rel="noopener noreffer">Data Engineering Podcast</a> describing the motivation and story behind a Python framework called Dagster that did exactly that. I was hooked, and have used Dagster ever since.</p>
<h3 id="early-focus-on-developer-friendliness">Early Focus on Developer Friendliness</h3>
<p>To understand the context of 2019, you must understand that back then, most ETL jobs were triggered with cron or bash scripts, and if there was an error, the only option was to re-run in the next nightly window where production wasn&rsquo;t touched. Dagster, as explained by Nick in the podcast, focused on developer-friendliness, in particular for ETL developers back then, and that focus hasn&rsquo;t changed today for data engineers.</p>
<p>So what is Dagster? The original idea, started in 2018 during a sabbatical after Nick worked at Facebook, came with this definition:</p>
<blockquote>
<p>One of the goals of Dagster has been to provide a tool that <strong>removes the barrier between pipeline development</strong> and pipeline operation, but during this journey, he came to <strong>link the world of data processing with business processes</strong>.</p>
</blockquote>
<p>Today the definition hasn&rsquo;t changed much and reads like this from the <a href="https://docs.dagster.io/" target="_blank" rel="noopener noreffer">Dagster Docs</a>:</p>
<blockquote>
<p>Dagster is a data orchestrator <strong>built for data engineers</strong>, with <strong>integrated</strong> lineage, observability, a declarative programming model, and best-in-class <strong>testability</strong>.</p>
</blockquote>
<p>The initial definition to &ldquo;link data processing with business&rdquo; was the key reason that brought me to it, along with the quality of how the components were implemented. Even more compelling was Nick&rsquo;s visionary outline for 3-5 years ahead: to make the work of data engineers similar to software engineers, and make their daily life easier.</p>
<h2 id="biggest-shift-early-on">Biggest Shift Early On</h2>
<p>This vision led to many new concepts Dagster originally created, which we take for granted in today&rsquo;s data work, and shifted the work into a more reliable and useful toolset for data engineers.</p>
<h3 id="data-aware-orchestration-shift">Data-aware Orchestration Shift</h3>
<p>One of the biggest shifts compared to previous tools and orchestrators was that orchestration was fully data-aware from the very beginning. It tried to understand the heterogeneous complexity that exists at every small to large enterprise company, and thrive in it, supporting the full data engineering lifecycle with its platform and data pipeline capabilities built in.</p>
<p>This gave me a toolkit for building reliable data pipelines out of the box early on, with battle-tested features through its users (open-source) and a quality and thoughtfulness I hadn&rsquo;t seen before. This was personified by Nick and could be vividly felt in the early interviews, but also in the code that the team produced openly on the repo.</p>
<p>For example, backfilling, restartability, or Spark integrations were open on GitHub, to use, and to adapt to your needs. It was a bit like dbt, but instead of modeling your SQL queries, you&rsquo;d model your data pipelines and integrate complex data architecture.</p>
<p>This also meant that moving datasets or integrating dependencies is strongly supported, not an afterthought. In Dagster you can use <a href="https://docs.dagster.io/dagster-basics-tutorial/resources" target="_blank" rel="noopener noreffer">resources</a> to work with Polars, Pandas, Arrow, DuckDB, or anything else to pass datasets, and reference data assets declaratively, or even non-existing ones as Dagster knows to create the assets. Compared to Airflow, where you could load only small data with XCom for the longest time<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>, this makes code simpler to understand and maintain.</p>
<p>Resources also decouple storage from compute. You could use Apache Spark locally with a single JAR file and in test use a full-blown cluster, or use MinIO as an S3 interface with simple bucket configuration and in production an S3 server from Amazon. Both are <strong>interchangeable without changing pipeline logic, by pure configuration</strong> — that&rsquo;s the beauty of declarative data systems, and Dagster embraces this to this day.</p>
<h3 id="shift-to-data-asset-based-orchestration">Shift to Data-Asset Based Orchestration</h3>
<p>Dagster was super early in shifting from DAGs and operational task-based orchestration to asset-based ones such as BI dashboards, tables, reports, ML models — artifacts the user actually cared about. Everything shifted from an imperative approach to a declarative one like Kubernetes or React, where you define what the dataset must or should have, and Dagster takes care of the implementation logic, <strong>mapping it to the configured filesystem, compute engine or cluster</strong>.</p>
<p>Now you could quickly describe each pipeline with declarative notations like <code>update: daily</code> or change to <code>update: monthly</code>, or if more advanced, you could define a small <a href="https://docs.dagster.io/guides/automate/sensors" target="_blank" rel="noopener noreffer">sensor</a> logic that checks S3 for updates. Instead of implementing all logic in a data pipeline, we <strong>apply the logic directly to the asset</strong> and closer to the data, which makes it more transparent and integrated into the full <a href="https://dagster.io/glossary/data-lineage" target="_blank" rel="noopener noreffer">data lineage</a>. When updates happen, we have the full graph, but also a leaner and easier-to-maintain setup.</p>













  
<figure><a target="_blank" href="/blog/dagster-almanack-open-data-platform/dag-to-assets.webp" title="">

</a><figcaption class="image-caption">Going from typical DAG and task-based-oriented (first line) to asset-based DAG (exploded view) | More at <a href="https://www.youtube.com/watch?v=YYeTQJYvqjU&amp;t=408s" target="_blank" rel="noopener noreffer">Declarative Orchestration</a></figcaption>
</figure>
<p>Or zoomed out - you see the focus on the assets, the tables themselves:<br>













  
<figure><a target="_blank" href="/blog/dagster-almanack-open-data-platform/assets-view.webp" title="">

</a><figcaption class="image-caption">Global asset lineage, zoomed out. You see how the task-based view goes from function to function, from download to serve (each of which potentially hides multiple tables), where assets go from dataset to dataset, giving you much more information. | More at <a href="https://youtu.be/L5kTxCM-tOk?si=4Fh_zc0oTRHckrs8&amp;t=133" target="_blank" rel="noopener noreffer">Dagster Data Orchestration walkthrough</a></figcaption>
</figure></p>
<p>This led later to &ldquo;<strong><a href="https://dagster.io/blog/software-defined-assets" target="_blank" rel="noopener noreffer">Software-Defined Asset</a></strong>&rdquo; and its mental model where you can define an asset pre-runtime and declare connections to upstream or downstream data assets (e.g. real-time housing prices that we fetch from a webpage that does not exist beforehand). Now we can already build and implement our graph and data lineage without having to physically create the dataset first. Software-defined assets use code to define the data assets and are version-controlled through git and inspectable via tooling. This transparency allows anyone in your organization to <strong>understand the canonical set of data assets</strong> and reproduce them at any time, and also lays the groundwork for asset-based orchestration.</p>
<blockquote>
<p>As Rich Hickey <a href="https://www.youtube.com/watch?v=SxdOUGdseq4" target="_blank" rel="noopener noreffer">said</a>, the aesthetics of a programming language do not matter, only the outcome. Assets play into that fact. In data engineering terms, it&rsquo;s not code but data pipelines and DAGs, but what everyone cares about are their outcomes: the data assets.</p>
</blockquote>
<p>That led to the shift from working with tasks and DAGs to data assets. That&rsquo;s the developer-friendliness built in from day one: develop locally and deploy to test and production, with infrastructure and technical implementation decoupled from business logic.</p>
<p>It made Dagster a bit more complex to start with — you need to know more upfront — but since every enterprise hits these data engineering challenges eventually, it&rsquo;s better to embrace the fact and build for it. The result was an improved developer velocity, but what I noticed, too, was the joy of building reliable data pipelines, equipped with tools that helped me deal with errors, infrastructure, multi-tenancy, data science, big data, and everything thrown at me back then.</p>
<h3 id="not-only-single-purpose">Not only single Purpose</h3>
<p>When I first introduced Dagster at my previous company, all of a sudden other teams started to take notice and also wanted Dagster for other work such as provisioning infrastructure with one-click deployment. Especially the cloud platform team needed a tool to automate its scripts to deploy on Kubernetes, OpenShift, and everywhere else, but other teams also had needs to automate. With Python as the programming language for Dagster, reading from an external FTP server, transforming the data, and uploading it somewhere via API were not multi-month projects across different teams. Dagster&rsquo;s flexibility and <a href="https://docs.dagster.io/integrations/libraries" target="_blank" rel="noopener noreffer">integration</a> into other tools and systems were a key strength for most teams.</p>
<p>Observability and monitoring were another addition. Every run was logged in the UI and everyone could see the rich metadata of each pipeline run. And because it was open source and had a rapidly growing community, support and ideas didn&rsquo;t run out.</p>
<h2 id="dealing-with-the-complexity-of-enterprise-systems">Dealing with the Complexity of Enterprise Systems</h2>
<p>If you have worked at any company larger than 10 people, you have noticed pretty fast that you are dealing with multiple source systems, different CRMs, different ERPs, multiple cloud platforms. Most enterprises have all major cloud platforms running in production, whether it is Amazon services, Google GCP, Azure, or any other major platform. You as the data engineer are the one making sure to integrate them, and basically <strong>deal with the complexity that comes with it</strong>.</p>
<h3 id="how-to-reduce-complexity">How to Reduce Complexity?</h3>
<p>First, acknowledge it: heterogeneous data complexity is a fact of the enterprise data lifecycle. Second, lean on tooling with technical integrations and written code that implements each vendor&rsquo;s API, so we don&rsquo;t build everything from scratch repeatedly. Third, work around the actual data assets the users want, not DAGs. With assets we declare outcomes, tests, and dependencies, and the system handles the rest. That cuts dependency hell, an unproductive <a href="https://aws.amazon.com/what-is/sdlc/" target="_blank" rel="noopener noreffer">Software Development Lifecycle</a>, and the fear of change.</p>
<h3 id="composable-is-making-systems-simpler">Composable is Making Systems Simpler</h3>
<p>A more holistic framing comes from <a href="https://www.youtube.com/watch?v=SxdOUGdseq4" target="_blank" rel="noopener noreffer">Simple Made Easy</a> by Rich Hickey, creator of the Clojure functional programming language, where he debates what makes systems complex: state and objects, lots of vars, syntax, inconsistency. His conclusion is that <strong>composable</strong> is what makes systems simpler (like in music for a composer, which is what he created Clojure for): the ability to <strong>assemble, reassemble, and swap individual components</strong> into a flexible whole.</p>
<p>






</p>
<p>Dagster has exactly that ability, too. It integrates the core principles of <a href="https://maximebeauchemin.medium.com/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a" target="_blank" rel="noopener noreffer">Functional Data Engineering</a> — pure and idempotent tasks, immutable partitions, reproducibility, versioning — directly into the framework.</p>
<p>State, Hickey points out, is never simple. Unfortunately for us, data engineering is <em>all</em> state: every datum is tied to a timestamp of when it was created, processed, or backfilled. Fortunately, helping us manage that state is what Dagster does: data assets, <a href="https://docs.dagster.io/guides/build/assets/virtual-assets" target="_blank" rel="noopener noreffer">virtual assets</a>, partitions, incremental materialization, and <a href="https://dagster.io/blog/dagster-1-13-octopuss-garden" target="_blank" rel="noopener noreffer">partitioned asset checks</a> that evaluate a specific partition of an upstream asset instead of the whole dataset.</p>
<p>Rich concludes &ldquo;<strong>Simplicity is a choice</strong>&rdquo;, echoing Leonardo Da Vinci:</p>
<blockquote>
<p>Simplicity is the ultimate sophistication</p>
</blockquote>
<h2 id="open-data-platform">(Open) Data Platform</h2>
<p>But how did the shift and the data engineering principles evolve, and how can we apply these as a unified solution?</p>
<p>With Dagster&rsquo;s data-aware orchestration, shift to assets, separation of concerns, and multi-use, we are automating harder data and infrastructure deployment problems, and Dagster solves the problem of managing complex data environments more holistically. To me, it feels as though Dagster gave me this peace of mind and the toolset to simplify data engineering in a complex environment early on, and is the right tool for the ultimate sophistication for data work.</p>
<p>Dagster&rsquo;s approach is composing a data orchestrator that integrates into any type of data work: from data integration with dlt, to transformation with dbt or just Python logic, to updating BI dashboards, to deploying on Kubernetes, all into a unified system. A fully <strong>open data platform</strong>, making <strong>orchestrating data and its flow <em>simpler</em></strong>.</p>
<p>All of these features, combined with DevOps deployment strategies, make Dagster one of the data platform tools that has:</p>
<ul>
<li>An <a href="https://docs.dagster.io/guides/operate/webserver#dagster-ui-reference" target="_blank" rel="noopener noreffer">integrated UI</a> and control plane for seeing what&rsquo;s going on, unifying all your tools into a single webpage.</li>
<li>Lets you <a href="https://docs.dagster.io/guides/operate/webserver#assets" target="_blank" rel="noopener noreffer">see your data assets</a> in a list with extensive metadata: it&rsquo;s your data catalog showing all tables, BI dashboards, reports, and other data assets.</li>
<li>Keeps and creates all the metadata when we run data pipelines across all data systems: With end-to-end access, we can also have metadata and data lineage end to end, which helps us understand where the data comes from, and in case of error, where the bad data is.</li>
<li>Has integrated scheduling, sensors, backfills, and <a href="https://docs.dagster.io/getting-started/concepts" target="_blank" rel="noopener noreffer">concepts</a> to work with data, built in.</li>
<li>Supports multi-team isolation through code locations, so different teams can own different parts of the platform without stepping on each other.</li>
</ul>
<p>As Dagster is open source, it gets promoted from a usual data orchestrator to an <strong>open data platform</strong>, with the great advantage of transparency: easily patching an error or integration if you need to integrate an obscure system that only your company has, working on the cutting edge with the community, or getting features from them.</p>
<blockquote>
<p>[!note] The Venn diagram of Dagster<br>
Obviously you can&rsquo;t optimize a data platform in all directions. If you look at Dagster as a Venn diagram, it has these three circles: the right <strong>abstraction, flexibility, and full automation</strong> through programmability.</p>
</blockquote>
<h3 id="control-plane-center-with-all-metadata">Control Plane: Center with All Metadata</h3>
<p>It integrates multiple different teams such as data engineers, platform and infra teams, with data science and business people who want to run their jobs. Feature-wise, it provides <strong>data catalog</strong> and contract capabilities, lets you see data assets you&rsquo;re responsible for and when they last got updated, and shows their downstream and upstream dependencies, all in one real-time observability and monitoring UI.</p>
<p>This is all done through its <strong><a href="https://youtu.be/rB2nNEEIRBE?si=nhnhaSGqIt7fd6pf&amp;t=743" target="_blank" rel="noopener noreffer">control plane</a></strong>, which centers all metadata and unifies different data tools along the lifecycle into one platform, something usually only closed-source data platforms achieve. The control plane serves everything in a <strong>single view</strong>, showing how all processes are working. It&rsquo;s the operational dashboard for your company.</p>
<p>With data orchestration as the heart of data work, with metadata for any process and access to all source systems we&rsquo;re pulling from or intermediate systems, it&rsquo;s in the perfect place to serve as the central metadata store. Think of INFORMATION_SCHEMA, but for overall data work, not only one single database. Only the orchestrator can understand the system and its status this deeply.</p>
<h3 id="why-open-data-platform-a-system-that-unifies-open-source">Why Open Data Platform: a System that Unifies Open Source</h3>
<p>The <em>open</em> in open data platform is interesting, as it&rsquo;s really hard to build a unifying layer across different stateful data systems, and it&rsquo;s worth highlighting that Dagster achieved just that.</p>













  

























<figure>
<a target="_blank" href="/blog/dagster-almanack-open-data-platform/concept-dagster-open-platform.png" title="Open Data Platform with Dagster as the integrative data orchestrator into different layers of the data stack — built on open standards, with an open data architecture such as object storage (S3 specs), file formats (Parquet, ORC, Avro), open table formats (Iceberg, Delta, Hudi), and data catalogs. | Legend: Dark blue shaded : Part of Dagster (control plane, orchestration, etc.), light blue: dagster managed metadata, white: external state and systems">

</a><figcaption class="image-caption">Open Data Platform with Dagster as the integrative data orchestrator into different layers of the data stack — built on open standards, with an open data architecture such as object storage (S3 specs), file formats (Parquet, ORC, Avro), open table formats (Iceberg, Delta, Hudi), and data catalogs. | Legend: Dark blue shaded : Part of Dagster (control plane, orchestration, etc.), light blue: dagster managed metadata, white: external state and systems</figcaption>
</figure>
<p>If we look at open data stack architecture: unlike cloud data platforms that have the same goal, such as Fabric, Snowflake, Databricks, etc., Dagster builds on <strong>open standards</strong> and is itself an open standard. It&rsquo;s like a protocol in which we declaratively define our data assets (e.g. Software-Defined Assets, environments, resources) that then get automatically executed with composable computes we define in resources, all interchangeable. Even closed-source engines such as a Databricks Spark cluster work really well.</p>
<p>The hardest part is integrated data governance, lineage, access rights, and compute. An orchestration platform like Dagster doesn&rsquo;t give you everything, but you get most of what you need, in an open and composable way.</p>
<h3 id="composable-data-stacks-possible">Composable Data Stacks Possible</h3>
<p>What this architecture allows is what Wes McKinney calls &ldquo;Composable Data Stacks&rdquo; in <a href="https://open.spotify.com/episode/4yEBsHs75QyxnQqK11ghyC?si=2c7861fde2354a52" target="_blank" rel="noopener noreffer">Monday Morning Data Chat</a>, essentially <a href="https://dagster.io/blog/rebundling-the-data-platform" target="_blank" rel="noopener noreffer">rebundling the data platform</a>.</p>
<p>Composable data stacks depend on compute engines. With this architecture, plus Dagster resources or <a href="https://docs.dagster.io/integrations/external-pipelines" target="_blank" rel="noopener noreffer">external code (Dagster Pipes)</a>, we can easily pick and choose what is best suited for the task at hand, not only for different jobs but also depending on different environments. Although there will never be one singular tool for everything, it&rsquo;s necessary that we have a layer of integration, and there&rsquo;s no better place than the orchestration layer that separates execution and technical logic from business and already deals with multiple data environments.</p>
<p>Pete Hunt, the CEO of Dagster, <a href="https://www.linkedin.com/feed/update/urn:li:activity:7447649923132481536?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A7447649923132481536%2C7447978059514925056%29&amp;dashCommentUrn=urn%3Ali%3Afsd_comment%3A%287447978059514925056%2Curn%3Ali%3Aactivity%3A7447649923132481536%29" target="_blank" rel="noopener noreffer">said recently</a>:</p>
<blockquote>
<p>Our goal is to make AI as lightweight, accessible and cheap as we can to drive enterprise value, while enabling our customers with the infrastructure they need underneath - the orchestration and <strong>data platform layer where reliability, support, and recovery</strong> actually matter.</p>
</blockquote>
<p>Pete is reaffirming that the data platform layer is the priority, so AI can build on top of a great foundation. In the eyes of a data engineer, it&rsquo;s a dream come true to have an open platform that is declarative to simplify the overall architecture, but also uses AI for the glue code, based on <strong>best practices enforced through Dagster as the open data platform</strong>.</p>
<h2 id="the-right-abstraction-layer-for-an-open-data-platform">The Right Abstraction Layer for an Open Data Platform</h2>
<p>Charlie Munger&rsquo;s Almanack distilled decades of investing wisdom into timeless principles and mental models that compound over time. Here I tried the same for Dagster.</p>
<p>The principles touched on won&rsquo;t be obsolete next year. Data-aware orchestration, declarative assets over imperative DAGs, separation of business logic from infrastructure, and <strong>composable stacks with a single control plane</strong> are all mental models for building data platforms that hold up whether you&rsquo;re running DuckDB on a laptop or Spark across three cloud providers.</p>
<p>Eight years after discovering Dagster on a podcast during my time in Copenhagen, I&rsquo;m still reaching for it whenever a system gets complex enough to need real orchestration. With extensive built-in <a href="https://docs.dagster.io/guides/operate/configuration/advanced-config-types#union-types" target="_blank" rel="noopener noreffer">data quality checks</a>, <a href="https://docs.dagster.io/examples/best-practices" target="_blank" rel="noopener noreffer">best practices</a> like <a href="https://docs.dagster.io/guides/test/unit-testing-assets-and-ops" target="_blank" rel="noopener noreffer">unit-testing</a>, <a href="https://docs.dagster.io/guides/operate/configuration/using-environment-variables-and-secrets#per-environment-configuration" target="_blank" rel="noopener noreffer">local development to prod</a>, separation of business and technical logic, <a href="https://docs.dagster.io/examples/best-practices/shared-module" target="_blank" rel="noopener noreffer">code locations</a>, <a href="https://dagster.io/blog/dsls-to-the-rescue" target="_blank" rel="noopener noreffer">Domain Specific Languages (DSLs)</a> for non-technical people, <a href="https://docs.dagster.io/integrations/external-pipelines" target="_blank" rel="noopener noreffer">pipes</a> and <a href="https://docs.dagster.io/getting-started/concepts#component" target="_blank" rel="noopener noreffer">components</a> to run something in Rust or Go, and <a href="https://docs.dagster.io/getting-started/concepts" target="_blank" rel="noopener noreffer">many more</a>, the Dagster data platform gives you huge leverage building from strong foundations, with the flexibility to change along the way.</p>
<p>It&rsquo;s the abstraction layer for data engineering to solve hard business problems, an open data platform with opinionated design decisions that compound the longer you build on them.</p>
<p>In the next piece, I&rsquo;ll get into what it actually takes to operate this — architecture, deployment, and governance — as a follow-on to these principles.</p>
<h2 id="next-steps">Next Steps</h2>
<p>Find <a href="https://github.com/dagster-io/skills" target="_blank" rel="noopener noreffer">Dagster&rsquo;s official skills</a> for the latest and most updated way of working with Dagster, to feed to your AI agent. Or read the <a href="https://dagster.io/blog/evaluating-agent-skills" target="_blank" rel="noopener noreffer">blog post</a> with more information. If using Airflow, <a href="https://docs.dagster.io/migration/airflow-to-dagster" target="_blank" rel="noopener noreffer">migrate from Airflow</a>, or use <a href="https://docs.dagster.io/integrations/libraries/airlift" target="_blank" rel="noopener noreffer">Airlift</a> for an integration for legacy and critical DAGs still in Airflow.</p>
<p>Find <a href="https://github.com/dagster-io/awesome-dagster" target="_blank" rel="noopener noreffer">awesome-dagster</a>, and check out further readings of mine at <a href="https://www.ssp.sh/blog/data-integration-as-code-airbyte-dbt-python-dagster/" target="_blank" rel="noopener noreffer">Data Integration as Code: Configuring Airbyte and dbt with Python (Dagster)</a> or <a href="https://www.ssp.sh/blog/data-orchestration-trends/" target="_blank" rel="noopener noreffer">Data Orchestration Trends: The Shift From Data Pipelines to Data Products</a>.</p>
<p>Want to use all of this stress-free without the deployment burden? Use <a href="https://dagster.io/lp/dagster-plus-trial" target="_blank" rel="noopener noreffer">dagster+</a>. Great tradeoff between cloud and OSS, still having the OSS Dagster foundation, but profiting from extra features (GitHub integration, cloning, etc.) and not needing to set up a DevOps pipeline or fiddle with Kubernetes.</p>
<hr>
<pre class=""><em>Full article published at <a href="https://dagster.io/blog/the-dagster-almanack-from-complexity-to-composability" target="_blank" rel="noopener noreferrer">Dagster.io</a> - written as part of <a href="/services">my services</a></em></pre>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Since August 2024 and version 3.0, Airflow finally supports a declarative approach, inspired by Dagster too. Airflow now has data-aware orchestration with <a href="https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/assets.html" target="_blank" rel="noopener noreffer">Asset Definitions</a>:&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></description>
</item>
<item>
    <title>Plan Mode All the Time, Substrait over SQL, and the End of the DE Role ft. Chris Riccomini</title>
    <link>https://www.ssp.sh/blog/how-to-use-ai-with-de-chris-riccomini/</link>
    <pubDate>Tue, 26 May 2026 08:00:08 &#43;0200</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/how-to-use-ai-with-de-chris-riccomini/</guid><enclosure url="https://www.ssp.sh/blog/how-to-use-ai-with-de-chris-riccomini/featured-image.png" type="image/png" length="0" /><description><![CDATA[<p>This series interviews (see <a href="/blog/specs-over-vibes-interview-mark-freeman/" rel="">#1 with Mark Freeman</a>) real practitioners to extract the patterns behind how they actually use AI in their data work today. This is the second interview in &lsquo;How to use AI with DE&rsquo;, and this time we have none other than <a href="https://www.linkedin.com/in/riccomini/" target="_blank" rel="noopener noreffer">Chris Riccomini</a>.</p>
<p>Chris has seen the data stack evolve over the years. He thinks AI will soon handle the majority of data engineering work, provided with the right tooling and access to CLIs and APIs. He also thinks LLMs might not speak SQL, but a format that represents data transformations. With so much shifting and changing currently in the AI space, new models, new workflows weekly, Chris&rsquo;s perspective helps you navigate without overreacting, based on a long experience in the domain.</p>
<p>The article is structured in four parts: <strong>(1)</strong> correctness when working with financial data, <strong>(2)</strong> the Ralph Loop and why AI might be better off speaking something other than SQL, <strong>(3)</strong> vulnerabilities and the case for &ldquo;Okta for Agents,&rdquo; and <strong>(4)</strong> the future of AI — including why &ldquo;data engineer&rdquo; as a distinct role might not survive.</p>
<h2 id="introducing-the-guest-2-chris-riccomini">Introducing the Guest: #2 Chris Riccomini</h2>
<p>Chris Riccomini is a Software Engineer, Author, <a href="https://materializedview.capital/" target="_blank" rel="noopener noreffer">Investor</a>, and Advisor. Previously at WePay, LinkedIn, PayPal, and author of <a href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838" target="_blank" rel="noopener noreffer">The Missing README: A Guide for the New Software Engineer</a> and co-author of 2nd version of the iconic <a href="https://www.amazon.com/dp/1098119061" target="_blank" rel="noopener noreffer">Designing Data-Intensive Applications</a> book.</p>
<p>Chris has been working in open source throughout his career. He is the author of <a href="https://github.com/apache/samza" target="_blank" rel="noopener noreffer">Apache Samza</a>, a distributed stream processing framework. His current project is SlateDB, an embedded key-value store built on object storage. He is also on the <a href="https://projects.apache.org/committee.html?airflow" target="_blank" rel="noopener noreffer">Apache Airflow&rsquo;s PMC</a>.</p>
<h2 id="correctness-of-data-in-the-financial-sector-how-does-this-work-with-ai">Correctness of Data in the Financial Sector: How Does This Work with AI?</h2>
<p>Chris had worked at financial companies where <strong>data correctness</strong> is essential. My first question was &ldquo;How do you see using AI in data when financial services, or most other places, must be correct? How do you mitigate the small errors AI still makes in such a situation?&rdquo; His response:</p>
<blockquote>
<p>It really depends on where in the stack AI is being deployed.</p>
</blockquote>
<h3 id="use-cases-with-different-risk-profiles">Use Cases with Different Risk Profiles</h3>
<p><strong>Risk, fraud and compliance</strong>. The bar is model explainability, you need to know <em>why</em> the model made the decision it did:</p>
<blockquote>
<p>If AI is involved in decisioning around risk and fraud, compliance and “model explainability” comes into play (why the model made the decision it did). This is one of the reasons we really liked random forest models at WePay: you could explain the actual rules that the model had derived and used in order to make a decision.</p>
</blockquote>
<p>The <strong>data engineering context</strong>, compared to a traditional data modeling situation, is interesting:</p>
<blockquote>
<p>If AI is being used in a data engineering context, it seems to me more like a <strong>traditional data modeling situation</strong>. You should be able to define invariants that must always be true for your data. For example, the ledger should always sum up. This is how we managed our data pipelines. If AI is defining data integration pipelines and moving data, the invariants should still hold. Traditional data verification tools will continue to play a role there.</p>
</blockquote>
<p>For <strong>data analytics</strong>, this is where most of the fear lives:</p>
<blockquote>
<p>There is a fear that AI will hallucinate and cause a bad decision to be made. <strong>I think this is a reasonable fear, but it’s also a problem we had before AI.</strong> Data in any organization is messy. Semantics aren’t always clear, contracts get broken, and so on.* *Every company I’ve worked for has had this problem. It’s <strong>not uncommon to find an incorrect query</strong> that’s been rolled up into a weekly ops review with the CEO, for example. This was true before AI.</p>
</blockquote>
<p>So the question, is whether AI makes this worse or better. Chris own view has shifted recently:</p>
<blockquote>
<p>If you’d asked me two years ago, I would have said it was definitely going to get worse. Now, I think it might actually get better, especially if we <strong>pair AI with a human</strong>. The latest LLMs have gotten really good at spotting bugs, inconsistencies, and so on. My personal experience is that I’m both <strong>more productive and more accurate with an AI</strong>.</p>
</blockquote>
<p>I am having a similar experience: for working data engineering projects, if I use it for a not-too-distant future, meaning if the scope is clear and in a framework or rigid structure, it can implement a great solution since last December 2025, when the models got better. With it, it can go a long way, but still, it can&rsquo;t work autonomously, or do a full project from scratch. It still needs a lot of hand-holding, as it does not understand the business.</p>
<p>So, balancing quantity with quality and keeping up with reviews at the speed of generation is also a challenge, especially since the model usually generates many lines of code. But for my writing process, where my personal voice plays a bigger role, I find that AI can&rsquo;t help me too much yet in the actual writing process - but on the surrounding tasks (research, brainstorming, though also limited for new topics that are not based on existing ideas).</p>
<h3 id="llm-should-speak-substrait-not-sql">LLM Should Speak Substrait, not SQL</h3>
<p>Chris <a href="https://x.com/criccomini/status/1946674377153786327" target="_blank" rel="noopener noreffer">said recently</a> that: &ldquo;<em>Similar to my belief that LLM should speak substrait, not SQL</em>&rdquo;. I asked him to explain this quote and he said:</p>
<blockquote>
<p>This is more of an intuition than something I’ve demonstrated to be true. But if you look at the way we use SQL, it’s actually used in two different ways: <strong>by humans and by machines</strong>. I think both can benefit from <a href="https://substrait.io/" target="_blank" rel="noopener noreffer">Substrait</a> (or some equivalent).</p>
</blockquote>
<p>Chris continues to explain that &ldquo;<em><strong>Substrait is a format that represents data transformations</strong>. It has many operations that SQL has, but unlike SQL, which is purely logical, <strong>Substrait lets you define physical operations</strong> as well. In SQL, you say JOIN, but in Substrait you can say how to join: merge join or hash join? For those with a compilers background, Substrait can express both abstract and concrete syntax trees–intermediate representations (IRs).</em>&rdquo;</p>
<p>This is valuable for LLMs for two reasons:</p>
<blockquote>1. You should be able to <strong>express SQL with fewer tokens</strong> (provided the serialization format for the logical operations is more efficient than english). This should make LLMs slightly cheaper to use, but more importantly it should <strong>keep them from hallucinating quite as much</strong>. (Granted hallucinations are less of a problem than they used to be).<br><br>
2. More importantly, LLMs are pretty smart. They should be able to do query optimization really well. And Substrait <strong>gives them that ability–they can express physical operators</strong> (e.g. merge vs. hash), not just logical ones. This should allow them to do <strong>query optimization on the client side</strong>, and pass a physical query plan directly to the DB for execution (provided they have access to the requisite table statistics).</blockquote>
<p>Substrait, as an emerging standard that provides cross-language serialization for relational algebra, is very interesting and something I want to check out, especially the expressiveness compared to SQL.</p>
<blockquote>
<p>[!note] Downside of Substrait: LLMs are less familiar<br>
Of course, there is a ton of SQL on the internet, so it’s not clear that LLMs will be as amenable to working with lesser known formats like Substrait. I think it’s worth experimenting with, though.</p>
</blockquote>
<h2 id="making-ai-output-more-reliable">Making AI Output More Reliable</h2>
<p>What I learned is that the longer something is in the future, the more vague or incorrect or hallucinated the outcome can be. So the more context and code you can provide, the more accurate the result. Which is pretty much in line with Substrait.</p>
<p>But how do we work with the LLMs, what&rsquo;s the best approach, using <code>god mode</code> in OpenClaw or <code>--dangerously-skip-permissions</code> in Claude Code with no limits where it can go indefinitely with not much more context? I asked Chris if that&rsquo;s also what he observed, and if he uses <code>plan mode</code> and a declarative approach or pipelines, as it helps for context and collaborating with the AI on a shared output, usually Markdown.</p>
<blockquote>
<p>I was having coffee with a friend of mine, lamenting about this very problem a month or two ago. I was trying to get Codex to do something complex and it just kept falling on its face. My friend told me that you have to live in plan mode all the time. You can’t just ask it to plan the work, then flip to “Implement this plan.” You <strong>need to have the LLM iterate on the plan</strong> for many iterations. Probe its plan, ask it for details, ask it to expand sections, and so on. You need to get to the point where you feel like there’s no possible way the LLM can’t implement the plan incorrectly.</p>
</blockquote>
<blockquote>
<p>[!info] Spec-driven AI work</p>
<p>On the note of &ldquo;needing to have the LLM iterate on the plan for many iterations&rdquo;, Mark Freeman suggested in <a href="/blog/specs-over-vibes-interview-mark-freeman/" rel="">previous interview</a> the spec-driven development (SDD) approach with the open-source GitHub <a href="https://github.com/github/spec-kit" target="_blank" rel="noopener noreffer">Spec Kit</a>, check it out or read the previous interview for more context.</p>
</blockquote>
<h3 id="the-ralph-loop-and-managing-context">The Ralph Loop: And Managing Context</h3>
<p>After having a plan at hand, the next step is to keep the LLM&rsquo;s working memory lean:</p>
<blockquote>
<p>Once you have a good plan, you <strong>need to manage context</strong>. In some cases, you will need to take your plan and start with a fresh context in the LLM. In other cases, you’ll need to clear the context periodically throughout the work. I use a <a href="https://ghuntley.com/loop/" target="_blank" rel="noopener noreffer">Ralph Loop</a> for such cases<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>.</p>
</blockquote>
<p>I had the exact same experience when working with smaller code bases: to refresh context, the insights you gain over the iterations are not as effective if you add them bit by bit, compared to if you refresh memory and start over with all the new key insights provided at the very beginning, steering the model to a more tailored direction earlier on.</p>
<p>But with the Ralph Loop, which refers to understanding AI beyond surface-level applications, you get new insights that you can then add to your initial prompt, that you wouldn&rsquo;t have gained otherwise, by exploring deeper programmable patterns.</p>
<p>The loop is an iterative, autonomous AI development technique where a bash loop (or plugin) repeatedly prompts an AI agent with the same goal, forcing it to persistently iterate until tasks pass external tests. It forces the AI to work, fail, and fix errors until success, rather than relying on the AI to decide it is finished.</p>
<p>On top of that, Chris says &ldquo;<em>You also need to impose a lot of quality gates. As with plan mode, you need to overdo it. &lsquo;Quality&rsquo; is a bit of a squishy term</em>&rsquo;&rdquo;, and he breaks it into three steps:</p>
<blockquote>1. Define what quality is for your use case.<br>
2. Measure the quality.<br>
3. Enforce thresholds (gates) that your LLM must adhere to.</blockquote>
<blockquote>
<p>[!example] Example by Chris with the three steps for assessing quality</p>
<p><em>For example, part of your definition of quality might be test coverage. So that’s step 1. Then you set up a coverage tool for your codebase–step 2. Then, you put the phrase, “You always run tests before commit and keep test coverage above 90%,” in your <a href="http://CLAUDE.md" target="_blank" rel="noopener noreffer">CLAUDE.md</a>. Finally, you install a git commit hook that enforces this rule.</em></p>
<p><em>This is a very rudimentary example, but you get the idea. There are a ton of different things you can measure and monitor for your work. I enumerate many in the post <a href="https://rng.md/posts/code-quality-for-vibe-coded-projects/" target="_blank" rel="noopener noreffer">Code Quality Gates for Vibe-Coded Projects</a>.</em></p>
</blockquote>
<p>This essentially means we as the Prompt Engineers need to make sure that the workflow is correct, that we understand what we need to do, and accordingly adapt the workflow to get better code quality.</p>
<h3 id="what-about-functional-data-engineering-and-executing-deterministically">What about Functional Data Engineering, and Executing Deterministically?</h3>
<p>In related terms, just as AI might hallucinate, it also might generate different outcomes with the same questions and same context. It&rsquo;s non-deterministic. But data engineering works especially well if it&rsquo;s done reproducibly, so we can backfill our data pipelines reliably and trust they will fill the same way.</p>
<p>This also ties into functional data engineering, running jobs with reproducibility and idempotent. I asked Chris what he thinks about this dilemma.</p>
<blockquote>
<p>I’m not as worried about this as I used to be. A lot of <strong>tooling</strong> has popped up or evolved to help address this. <strong>Durable execution frameworks</strong> try to address some of this by papering over the non-determinism to keep replays deterministic <strong>by skipping the previously-successful</strong> parts of the flow. Ditto for traditional workflow orchestration systems like Airflow, Prefect, and Dagster. (Disclaimer: I have some Prefect shares.)</p>
</blockquote>
<h3 id="moving-to-incremental-loads-for-better-determinism">Moving to Incremental-loads for Better Determinism?</h3>
<p>What I found interesting was Chris&rsquo;s next suggestion: moving to smaller data sizes, and therefore to loading incrementally for a more reproducible outcome.</p>
<blockquote>
<p>We can also move from full batch data processing to <strong>incremental batch</strong> data processing to help eschew some non-determinism.</p>
</blockquote>
<p>A concrete example, splitting load by day:</p>
<blockquote>
<p>Imagine, you have a bulk load job that always loads a full table from PostgreSQL into Snowflake, and that job does some LLM-based processing. Every time you re-run it, you’re going to get non-deterministic output. But if you convert it to an incremental job that runs daily and always loads the previous day’s data, then a re-run will only introduce non-determinism into the last day’s load. And presumably you’re re-running that day because something went wrong. In such a case, non-determinism is likely acceptable.</p>
</blockquote>
<p>This is great thinking and shows it&rsquo;s all about the use case and the risk appetite. If you have a lot less back reloads daily, compared to a full load, the accepted risk of one day might be acceptable, if you get great insights from the LLM, or something you&rsquo;d need to do manually and then the alternative would be you either don&rsquo;t do it at all, or very late when the insight is &ldquo;less&rdquo; valuable.</p>
<p>Side note, the engineering implementation of incremental loads might be much higher than a full load, as you need to add clear state management, checking what has run, and manage that state yourself, versus just running all. But this point almost certainly comes up in any case, whether you use AI or not, so we can factor out that fact in this scenario.</p>
<h2 id="how-to-prevent-vulnerabilities-and-work-securely-with-ai-agents">How to Prevent Vulnerabilities, and Work Securely with AI Agents?</h2>
<p>Another hot topic with agents is security concerns around vulnerabilities. I asked Chris how he sees that domain in combination with generative AI, and also if we need &ldquo;Okta for Agents&rdquo;, as Maxime Beauchemin <a href="https://www.linkedin.com/posts/maximebeauchemin_i-finally-got-to-around-to-test-driving-clawdbot-activity-7423272818848550912-FSCn?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAABkA2pgBYM4xDO0z2ChYuxFhBfu4h7jp4Lo" target="_blank" rel="noopener noreffer">called</a> it.</p>
<p>His view splits cleanly in two:</p>
<blockquote>
<p>On the one hand, it’s a nightmare to manage these agents in the enterprise. On the other hand, they’re phenomenal at detecting compliance violations: leaked credentials, leaked PII, and so on.</p>
</blockquote>
<p>He&rsquo;d been thinking about an Okta-for-agents independently:</p>
<blockquote>
<p>It’s funny you mention Maxime’s “Okta for agents” comment. I didn’t see it, but I’ve been saying the exact same thing. It seems patently obvious to me. What’s unclear is whether Okta is Okta for agents, or whether another company (or companies) will take its place. Innovator’s dilemma and all. Okta’s certainly give it a good try–their homepage is covered in it now.</p>
</blockquote>
<h3 id="skills-marketplaces-and-mcps">Skills, Marketplaces and MCPs</h3>
<p>He continues and says that it&rsquo;s the wild west right now. You can load skills and even arbitrary skills from a marketplace and load any kind of text files without knowing if there&rsquo;s a vulnerability.</p>
<p>There are examples where hidden <a href="https://x.com/ZackKorman/status/2018386838101086446" target="_blank" rel="noopener noreffer">code injection</a> is done in a repo:<br>













  

























<figure>
<a target="_blank" href="/blog/how-to-use-ai-with-de-chris-riccomini/security-ingection.png" title="A hidden comment that is commented out below | source">

</a><figcaption class="image-caption">A hidden comment that is commented out below |  <a href="https://x.com/ZackKorman/status/2018386838101086446" target="_blank" rel="noopener noreffer">source</a></figcaption>
</figure></p>
<p>Chris continues with not having enough guardrails:</p>
<blockquote>
<p>But yes, we absolutely <strong>need lineage, auditability, RBAC, ABAC, and so on</strong>. It’s the wild west right now (as far as I know, anyway). This is one of the reasons I was so outspoken about MCP when it first came out. I was very <strong>disappointed in their (lack of) security model</strong>. It’s the most important part, and it was completely lacking. It was rather shocking to me given Anthropic’s focus on the enterprise. More recently, they’ve added better support, though, so credit where credit is due.</p>
</blockquote>
<h2 id="future-with-ai-agents">Future with AI Agents</h2>
<p>When asked about the future of AI, especially when we talk about data engineering, we discussed three interesting topics on what agents are doing well today, the role of data engineering itself and what programming language to use.</p>
<h3 id="what-agents-already-do-well-today">What Agents Already Do Well Today</h3>
<p>I asked if we get self-healing data pipelines, so we do not need to get up at night, meaning AI does not only detect errors, but also analyses, debugs, pushes a commit to the repo and re-runs the pipeline autonomously?</p>
<blockquote>
<p>I’ll be frank: I think AI will do the majority of the data engineering work in the future. I think we’re already at a point where it can; the tooling and practices just haven’t yet adapted.</p>
</blockquote>
<p>This is an interesting point regarding tooling (and practices) not being adapted yet. Jeff Dean, Chief Scientist at Google DeepMind, <a href="https://www.youtube.com/watch?v=g8BuAtM3fp4" target="_blank" rel="noopener noreffer">made the point</a> that Amdahl&rsquo;s Law still applies, and that we need to re-engineer our tools as they were designed for human speed. If AI agents can run 50x faster, but the tools don&rsquo;t, then we do not get an overall improvement.</p>
<p>On the other hand, what agents already do well today:</p>
<blockquote>
<p>Agents are already excellent at inspecting failed Github actions, failed workflows, running SQL queries, writing Python–all the things data engineers do. As they get plugged into monitoring systems and begin to auto-remediate, the grunt work of data engineering will get taken over by AI.</p>
</blockquote>
<p>And building new pipelines, given the right access:</p>
<blockquote>
<p>Agents are also fully capable of adding new data pipelines, provided they have access to infrastructure to do so. If you stand up a fresh Airflow and add connections for all your systems, I’d wager an Agent can set up as many pipelines as you need on it. And if you define the security and compliance policies it should follow, it’ll do so.</p>
</blockquote>
<p>Here, in my opinion, it is key that we use declarative and config-driven stacks, like Kubernetes and React are doing, and most modern tooling.</p>
<h3 id="data-engineering-role-going-away-or-unified">Data Engineering Role Going Away, or Unified?</h3>
<p>Continuing on the thread of the future of AI, Chris talks about how shifting left is a movement we had for a while, and where this leaves data engineers as a role:</p>
<blockquote>
<p>I’m not sure where that leaves data engineers. The “shift left” movement has been going on for a while. I can imagine a world in <strong>which “data engineer” as a distinct role goes away</strong>, or is folded back into a more generic data role that includes <strong>data engineering, machine learning, data analysis, and so on</strong>.</p>
</blockquote>
<p>He&rsquo;s been pushing this for <a href="https://materializedview.io/p/merge-analytics-and-data-engineers" target="_blank" rel="noopener noreffer">quite some time</a>:</p>
<blockquote>
<p>We over-specialized the data space. It might have been necessary, but it isn’t now. So perhaps we’ll see “data” be a single role that encompasses not just data engineering, but analysis and machine learning/AI as well. I think that would be healthy.</p>
</blockquote>
<h3 id="should-we-let-the-ai-agent-choose-the-language">Should We Let the AI Agent Choose the Language?</h3>
<p>We heard people saying (e.g. Wes McKinney) that they choose programming languages, in this case Go over Python, based on AI, not what the human prefers. He calls it <a href="https://wesmckinney.com/blog/agent-ergonomics/" target="_blank" rel="noopener noreffer">From Human Ergonomics to Agent Ergonomics</a>. That Wes, the creator of Pandas and author of Python for Data Analysis (stay tuned, he will be the next guest for this interview series), chose Go is interesting, and is because its advantages in fast compile-test cycles and painless software distribution are key. Don&rsquo;t worry, Python will not go away<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>.</p>
<p>Or Ladybird is <a href="https://ladybird.org/posts/adopting-rust/" target="_blank" rel="noopener noreffer">rewriting</a> part of the browser entirely from scratch in Rust with agents in two weeks. So Chris, do you think that choosing the programming language will depend on the ergonomics of the agents in the future (or now already)?</p>
<blockquote>
<p>In a word: yes. I have been pretty enthralled with the <strong>software factory concept</strong> lately. It’s how I do a lot of my development now. <strong>In that world, I just don’t care about the language</strong> my software is written in.</p>
</blockquote>
<p>What he optimises for instead:</p>
<blockquote>
<p>I care more about the characteristics of the output: its <strong>performance, stability, and cost to build</strong> (i.e. tokens). Languages that lend themselves to faster, cheaper, more stable LLM output are going to win.</p>
</blockquote>
<p>These are very interesting thoughts, and I did a project fully vibe coded in Go to experience the <strong>cost-as-tokens</strong> as well. The codebase kept being small (apart from the tests), and therefore I could go much further with the given tokens compared to other projects where I used the same Claude Plan Pro and ran out.</p>
<p>Go is a language I don&rsquo;t usually program in. And it is quite astonishing how far you get, but I also noticed a limitation as Lines of Code and size of the project grew, especially when adding new features that would break working features.</p>
<blockquote>
<p>[!note] Side note by Chris: No more proofs for a programming language required</p>
<p>I used to think that this would pull us in the direction of languages that have formal methods properties. Proofs, model checkers, and so on. I no longer think that’s the case, though. I think LLMs have gotten good enough at writing code that proofs are no longer required. We can use normal testing strategies. I wrote more about this in <a href="https://rng.md/posts/the-waymo-rule-for-ai-generated-code/" target="_blank" rel="noopener noreffer">The Waymo Rule for AI-Generated Code</a>.</p>
</blockquote>
<h3 id="does-ai-take-away-the-learnings">Does AI Take away the Learnings?</h3>
<p>Last question I asked Chris — the danger of not learning new things, and getting overwhelmed with constant stimulation, and even addicted? In a world where we only prompt, where we don&rsquo;t experience hitting a wall and then figuring it out, does that prevent us from learning new things? Are we just cruising on auto-pilot?</p>
<p>Chris mentions that it depends on how we use it and brings an example:</p>
<blockquote>
<p>One could argue a calculator makes us learn less math; indeed, I keep an eye on that with my middle school-aged kids. But it’s also a tool that lets us do far more complex math without worrying about carrying the one or shifting the decimal, so to speak.</p>
</blockquote>
<p>But you can also learn <em>with</em> AI he argues:</p>
<blockquote>
<p>I have had instances where I learn a ton from AI. A concrete example: <a href="https://github.com/slatedb/slatedb" target="_blank" rel="noopener noreffer">SlateDB</a>’s language bindings. I built them all from scratch (or rather, AI generated them all from scratch). When I started, I knew nothing about bindings. As I <strong>worked with AI to steer it and iterate on the code, I learned</strong> about cbindgen, UniFFI, foreign function interfaces (FFIs), and so on. It’s a phenomenal tool for picking up something from scratch. I can ask it questions, learn from it, and so on.</p>
</blockquote>
<p>Again, did he actually learn as much (from scratch, with AI) as he would have building it himself?</p>
<blockquote>
<p>Almost certainly not, I think <strong>I would have learned a lot more [without AI]. But I also wouldn’t have done the work</strong>. Writing four bindings (Node, Java, Python, and Go) from scratch is just too much work. I don’t have the time for it. Especially since I have never written a line of Go, and I know next to nothing about the Node ecosystem. So in the real world, I think I came out ahead.*</p>
</blockquote>
<h4 id="do-we-learn-fewer-things">Do We Learn Fewer Things?</h4>
<p>Let&rsquo;s finish with a question: Are we learning <em>fewer</em> or just <em>different</em> things? Something I&rsquo;ve wrestled with for a while. Chris&rsquo;s answer is:</p>
<blockquote>
<p>Perhaps the things we are no longer learning don’t really matter anymore. Going back to the calculator example, I couldn’t really tell you in detail how a calculator physically works. If you took it apart and showed me its circuitry, I’d be unable to tell you anything about it, really. Does that matter? I’m not so sure.</p>
</blockquote>
<p>I think we all are in this experience together, and nobody can really predict the future. I experienced both sides: when I rely too much on the assistant, I get more lazy and do the <em>deep thinking</em> less. While I course-corrected, and only used it for dedicated tasks, I noticed that abilities were improving again, or better, my feel and gut feeling got better again, and I had more confidence in the task at hand. But also, as Chris said, if I know it&rsquo;s going to be a hard task, I can do much more because I deliberately use AI for certain tasks to actually finish the task. So the future will tell.</p>
<h2 id="next-interview">Next Interview</h2>
<p>I hope you enjoyed this interview with Chris. Huge thanks to Chris for taking the time to speak with me and for sharing his experience with all of us. Follow him on <a href="https://www.linkedin.com/in/riccomini/" target="_blank" rel="noopener noreffer">LinkedIn</a>, <a href="https://x.com/criccomini" target="_blank" rel="noopener noreffer">X/Twitter</a> or on <a href="https://bsky.app/profile/chris.blue" target="_blank" rel="noopener noreffer">Bluesky</a>, read <a href="https://www.amazon.com/s?i=stripbooks&amp;rh=p_27%3AChris%2BRiccomini&amp;s=relevancerank&amp;text=Chris&#43;Riccomini" target="_blank" rel="noopener noreffer">his two amazing books</a>. Follow his amazing newsletter, the new one at <a href="https://rng.md/" target="_blank" rel="noopener noreffer">Posts on engineering, venture capital, AI, and more. | rng.md</a>, but also his old one <a href="https://materializedview.io/" target="_blank" rel="noopener noreffer">Materialized View</a> has a wealth of insights.</p>
<p>There are three more interviews already lined up with great guests, one of them is Wes McKinney as mentioned, so please share feedback, questions you might want to ask or just your experience on how to work with AI in the data space. We&rsquo;re all in this together, figuring it all out. The more we can learn from each other, what&rsquo;s important, and maybe also what&rsquo;s not, the better.</p>
<p>So stay tuned for the next interview.</p>
<hr>
<pre class=""><em>Full article published at <a href="https://motherduck.com/blog/cost-as-tokens-substrait-llm-chris-riccomini/" target="_blank" rel="noopener noreferrer">MotherDuck.com</a> - written as part of <a href="/services">my services</a></em></pre>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Chris hints at some of his recent <a href="https://rng.md/posts/wiggum-loop/" target="_blank" rel="noopener noreffer">Wiggum Loop</a> post.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>From <a href="https://wesmckinney.com/blog/agent-ergonomics/" target="_blank" rel="noopener noreffer">https://wesmckinney.com/blog/agent-ergonomics/</a>: Python will remain essential as an exploratory computing layer for humans and agents to collaborate on data analysis, research, and data visualization. Notebook layers (Jupyter, Marimo, and so forth) and hybrid IDEs (like Positron, where I’ve been contributing in the last couple of years) will increasingly focus on catering to the human-in-the-loop data scientist or ML engineer, even though the “Python part” may become thinner and thinner as the lower layers of the stack are re-engineered for performance and agentic engineering productivity.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></description>
</item>
<item>
    <title>Internal vs. External Storage? What&#39;s the Limit of External Tables</title>
    <link>https://www.ssp.sh/blog/modern-external-tables-and-evolution/</link>
    <pubDate>Thu, 14 May 2026 00:08:08 &#43;0200</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/modern-external-tables-and-evolution/</guid><enclosure url="https://www.ssp.sh/blog/modern-external-tables-and-evolution/featured-image.png" type="image/png" length="0" /><description><![CDATA[<p>When I started my career as a data warehouse engineer and business intelligence engineer in 2003, external tables with materialized views were the standard. We used external tables to integrate CSV files and other data not already in Oracle databases. Oracle External Tables have existed since 2001, and that&rsquo;s where I first used them. If the Lindy Effect continues to hold, we&rsquo;ll use external tables even longer. But why have they survived for so long?</p>
<p>The core question is: &ldquo;When should you store data internally in your warehouse versus externally in object storage?&rdquo;. Hot data queried frequently goes inside. Cold archival data stays external, where it&rsquo;s cheaper but slower. Interestingly, Databricks and BigQuery recently added external table features, but why? Not because they&rsquo;re trendy, but because the economics still work.</p>
<p>This article offers an inside look at external tables, their 25-year history, how they evolved from CSV parsers to ACID lakehouse tables, and whether you need to know about them today.</p>
<h2 id="what-are-external-tables">What Are External Tables?</h2>
<p>So what are external tables, and why have we been using them for so long? Why don&rsquo;t we just use the internal storage of a database?</p>
<p>In Oracle, where I first used them in 2008, they allowed you — and still do — to access data in external tables. External tables are defined as <strong>tables that do not reside in the database</strong>, and can be in any format for which an access driver<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> is provided. All of this is provided via <a href="https://en.wikipedia.org/wiki/Data_definition_language" target="_blank" rel="noopener noreffer">DDL</a> (Data Definition Language) of the database, describing an external table with all its columns, data types, etc., exposing the data as if it were residing in a regular database table.</p>
<p>The external data can be queried in parallel and <strong>queried directly using SQL</strong>. Essentially, it&rsquo;s read-only access to data stored outside of our database, making it available in a tabular, easy-to-work-with format to interact with existing tooling and language. In 2008, this was through procedural language such as PL-SQL in Oracle or T-SQL on MSSQL.</p>
<p>Today, external tables have evolved. The biggest change is that they can read more formats including semi-structured data such as Parquet, JSON, Avro, and ORC. While CSV was readable in 2008, the difference today is the columnar formats and nested formats that enable faster analytics. These are available for downstream processes and dashboards, but mostly accessed through SQL queries in one form or another.</p>
<p>A modern definition by <a href="https://research.google/pubs/biglake-bigquerys-evolution-toward-a-multi-cloud-lakehouse/" target="_blank" rel="noopener noreffer">BigLake</a>, an evolution of BigQuery toward a multi-cloud lakehouse that tries to solve key customer requirements around the unification of data lake and enterprise data warehousing workloads, <a href="https://docs.cloud.google.com/bigquery/docs/external-tables" target="_blank" rel="noopener noreffer">introducing</a> external tables in 2015 as part of it<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>:</p>
<blockquote>
<p>External tables are stored outside of BigQuery storage and refer to data that&rsquo;s stored outside of BigQuery. [..] Google Non-BigLake external tables let you query structured data in external data stores. To query a non-BigLake external table, you must have permissions to both the external table and the external data source.</p>
</blockquote>
<p>Snowflake <a href="https://docs.snowflake.com/en/sql-reference/sql/create-external-table" target="_blank" rel="noopener noreffer">defines</a> them as:</p>
<blockquote>
<p>[&hellip;]  When queried, an external table reads data from a set of one or more files in a specified external stage, and then outputs the data in a single VARIANT column. Additional columns can be defined, with each column definition consisting of a name, data type, and optionally whether the column requires a value (NOT NULL) or has any referential integrity constraints.</p>
</blockquote>
<p>External tables were <a href="https://www.snowflake.com/en/blog/external-tables-are-now-generally-available-on-snowflake/" target="_blank" rel="noopener noreffer">added in 2021</a>, and Snowflake described their benefits as follows:</p>
<blockquote>External Tables Address Key Data Lake Challenges:
<ol>
<li>To <strong>augment an existing data lake</strong>. [..] augment their existing data lake, rather than replace it. The External Tables feature enables that use case. Customers can use external tables to query the data in their data lake without ingesting it into Snowflake. (side note: MVs<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup>)</li>
<li>Ad-hoc analytics. Customers often use external tables to <strong>run ad-hoc queries directly on raw data before ingesting the data</strong> into Snowflake. Ad-hoc queries help them evaluate data sets and determine further actions.</blockquote></li>
</ol>
<div class="mermaid" id="id-7"></div>
<h3 id="just-a-pointer-symlink">Just a Pointer (Symlink)?</h3>
<p>A simple analogy is a <strong>symlink in Linux</strong>, where you point from your current directory to another directory without moving data. You just add a pointer. If you read that file from that symlink, all it does is read it from the location the symlink points to.</p>
<p>An external table is the same, just a <strong>pointer</strong> to external data, bringing that data into the current data warehouse or cloud solution, hence the word external. You define the source format such as XML, CSV, etc., and define their structure, and then you can query that at any time. It&rsquo;s similar to a SQL View in that sense, but pointing to non-internal data.</p>
<p>Running <code>DROP TABLE</code> and deleting an external table is metadata-based only. No data is removed, only the table definition from the internal data catalog. The same is true with a symlink. Almost any relational database today has support for it, even if it&rsquo;s not called an external table. Everyone occasionally needs to read data outside of its warehouse or database.</p>
<h2 id="recap-in-the-history-of-external-tables">Recap in the History of External Tables</h2>
<p>Looking back at the history and evolution of external tables, we can quickly see that there&rsquo;s a long history and they&rsquo;ve been a <strong>recurring pattern</strong> across every generation of database technology since the early 2000s, and arguably longer if you count IBM&rsquo;s federated database concepts from the late 1990s.</p>
<div class="mermaid" id="id-8"></div>
<h3 id="the-origin-story-iso-in-2001">The Origin Story: ISO in 2001</h3>
<p>The history starts with <a href="https://www.iso.org/standard/31370.html" target="_blank" rel="noopener noreffer">ISO/IEC 9075-9</a>, published in 2001. Part 9 of the SQL standard defined foreign-data wrappers and datalink types for managing external data from within SQL. The work was completed in late 2000 and published alongside SQL:1999, with full integration in SQL:2003 (it was later <a href="https://www.iso.org/standard/84804.html" target="_blank" rel="noopener noreffer">updated in 2023</a>).</p>
<p>It was the initial definition and extensions to database language SQL to support management of external data <strong>through the use of foreign-data wrappers and datalink types</strong>.</p>
<p>My first encounter was with Oracle external tables, but according to <a href="https://en.wikipedia.org/wiki/Open_Database_Connectivity" target="_blank" rel="noopener noreffer">Wikipedia</a> there were earlier implementations, such as <strong>Microsoft Access linked tables (~1992)</strong>. Microsoft Access linked tables (~1992) were the earliest consumer-facing implementation where users could link dBASE, Paradox, text files, and ODBC sources as if they were Access tables. <strong>ODBC 1.0 (1992)</strong> itself established the first standard for heterogeneous data access across databases, though it didn&rsquo;t create table abstractions.</p>
<p>Further, <strong><a href="https://www.mcpressonline.com/analytics-cognitive/db2/the-as400-and-ibms-db2-datajoiner" target="_blank" rel="noopener noreffer">IBM&rsquo;s DB2 DataJoiner</a> (~1995)</strong> was more ambitious with a middleware product enabling SQL queries across Oracle, Sybase, SQL Server, Informix, Teradata, and even VSAM files through a unified interface. With <strong>SQL Server 7.0&rsquo;s Linked Servers (1998)</strong> we got federated querying to Microsoft&rsquo;s ecosystem via <strong>OLE DB</strong>, supporting cross-database joins with four-part naming conventions.</p>
<p>Most of these implementations shared a common limitation that Oracle (<a href="https://oracle-base.com/articles/9i/sql-new-features-9i" target="_blank" rel="noopener noreffer">9i Release 1 - 9.0.1</a> in 2001) solved: they focused on querying <em>other databases</em> or required middleware. Oracle&rsquo;s abstraction treated local flat files as first-class read-only table objects using the familiar <code>CREATE TABLE ... ORGANIZATION EXTERNAL</code> DDL syntax, providing a simple way to define external files as part of normal table creation and allowing ORACLE_LOADER access to query flat files (CSV, fixed-width, delimited) through DBAs.</p>
<p>It was an early way of separating declaration from compute (the Oracle loaders).</p>
<h2 id="why-external-tables-what-are-their-benefits">Why External Tables? What Are Their Benefits?</h2>
<p>But why use external tables? What makes them so useful that they persisted? Why have they <strong>survived so long</strong>, and why are they getting added to Databricks and other major platforms?</p>
<p>For that, we need to look at external tables&rsquo; benefits. The first reason is that external tables can simplify data access to <strong>avoid developing ETL pipelines</strong>, moving data out of the source, and re-ingesting it in our data warehouse. They make external data accessible easily, defined in a tabular form by a database schema with column types. Typical cloud data warehouses like Snowflake and Azure use them to link existing data from object storage easily without moving data. This makes the object storage files accessible for almost any downstream tool or query language in a simple and cost-effective way.</p>
<p>Other ways of using them are to store some data on <strong>cheaper storage</strong> (e.g., object storage over data warehouse storage) and only link them in. It&rsquo;s slower to fetch, but more affordable to keep. If you have large data sets, cost savings can be immense as this article <a href="https://medium.com/@abhidutty/optimize-data-storage-costs-by-70-using-databricks-snowflake-aws-s3-332f44949e93" target="_blank" rel="noopener noreffer">shows</a>, bringing down Snowflake internal storage cost from ~$23/TB/month to S3 infrequent access with ~$12.50/TB or S3 Glacier Deep Archive with only ~$1/TB.</p>
<p>Another handy side effect as the consumer of external table data is that the <strong>data is always up to date</strong>, because no refresh or update is needed. It goes without saying that this has its own downsides and can be a problem for the owner of the data if it&rsquo;s used in production and the ETL process reads large amounts of data through external tables. This will affect upstream apps running or owning this data.</p>
<p>That&rsquo;s why many use external tables in combination with materialized views (MVs) to truncate and recreate a daily snapshot (or similar) during off-peak (mostly nights) of this data, avoiding affecting production data and even optimizing query performance with added indices for downstream queries.</p>
<h3 id="when-internal-and-when-external-data-whats-the-limit-of-external">When Internal and When External Data? What&rsquo;s the Limit of External?</h3>
<p>The tradeoffs come down to how often the data is queried, e.g. the hot versus cold question.</p>
<p>The tradeoffs and considerations you should make when wanting to use them come down to the decision of how often the data is queried. The table below shows it in more detail:</p>
<table>
  <thead>
      <tr>
          <th>Dimension</th>
          <th><strong>Internal Storage</strong></th>
          <th><strong>External Tables</strong></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Temperature</td>
          <td><strong>Hot</strong>: recent data, lasts weeks to months</td>
          <td><strong>Cold</strong>: archival or infrequently touched</td>
      </tr>
      <tr>
          <td>Typical use case</td>
          <td>Dashboards, frequent queries, sub-second latency</td>
          <td>Archival, ad-hoc exploration, augmenting a data lake</td>
      </tr>
      <tr>
          <td>Query speed</td>
          <td>Fast, optimized for repeated access</td>
          <td>Slower (a 1.3×–1.7× tax in the below dashboard benchmark)</td>
      </tr>
      <tr>
          <td>Storage cost</td>
          <td>Higher (warehouse-managed, ~$23/TB on Snowflake capacity)</td>
          <td>Lower: up to ca. 20× cheaper on S3 Glacier Deep Archive (~$1/TB)</td>
      </tr>
      <tr>
          <td>Data freshness</td>
          <td>Can go stale between ETL refreshes</td>
          <td>Always up to date, no refresh needed</td>
      </tr>
      <tr>
          <td>Setup effort</td>
          <td>Requires ETL pipelines, scripts or re-ingestion</td>
          <td>Simple DDL-only definition, data stays in place</td>
      </tr>
      <tr>
          <td>Scaling concern</td>
          <td>Disk grows faster than compute needs</td>
          <td>Heavy reads can affect upstream apps owning the source files</td>
      </tr>
      <tr>
          <td>Operational overhead</td>
          <td>Predictable, managed by the warehouse</td>
          <td>Small-file problem and manifest management for tiny or streaming datasets</td>
      </tr>
  </tbody>
</table>
<p>In the era of data lake and lakehouse architectures, this is an important consideration. VSCO <a href="https://eng.vsco.co/querying-s3-data-with-redshift-spectrum/" target="_blank" rel="noopener noreffer">says</a>: &ldquo;disk space was growing more quickly than our compute needs,&rdquo; which is what triggered the adoption of external tables.</p>
<p>If you look at your use case, if you need to do analytics across various sources with joins and augmentation of your data at an enterprise, you probably want to focus on loading data into your database or data warehouse, an architectural pattern that has survived more than 30 years. But if you have data that is external and small but you want to join it with existing data, or you always need fresh data and can live with a slower response time (maybe because it runs during the night), you might use external tables.</p>
<p>In any case, external tables are a good approach to keep in mind and a valuable <a href="https://motherduck.com/blog/data-engineering-toolkit-essential-tools/" target="_blank" rel="noopener noreffer">toolkit</a> to have.</p>
<h3 id="they-work-well-with-existing-tech-and-common-patterns">They Work Well with Existing Tech and Common Patterns</h3>
<p>Obviously, today&rsquo;s external tables are not the same as the earliest ones in Microsoft Access, but the principle of accessing data outside your system is still the same. Nowadays we have more support, new formats besides CSV and JSON. We can do Parquet or open table formats.</p>
<p>As mentioned, they work well with related long-lasting data warehouse patterns and applications such as materialized views and stored procedures. The recurring pattern is to access external data with your data management system, similar to the pattern of materialized views that refresh complex SQL statements and make them fast, and stored procedures that run glue code within your database.</p>
<p>Moreover, there are temporary tables that are similar but only available during a transaction or session. They all work in the same Lindy effect, e.g., Databricks just <a href="https://www.databricks.com/blog/introducing-temporary-tables-databricks-sql" target="_blank" rel="noopener noreffer">announced Temporary table support</a> recently on December 9th, 2025, or Databricks SQL Stored Procedure a <a href="https://www.databricks.com/blog/introducing-sql-stored-procedures-databricks" target="_blank" rel="noopener noreffer">little earlier</a>, August 14th, 2025, for reusing existing SQL statements.</p>
<p>Again and again, <strong>everything that is old will be new again</strong>. Exactly what the Lindy Effect is all about. We can clearly say that the Lindy effect over the last 33 years applies here. The longer something is in place, the more likely it is to be around for at least that long.</p>
<blockquote>
<p>[!info] External vs. Temporary Table</p>
<p>In contrast: temp table = session-scoped, writable, fast, invisible to others, auto-dropped. External table = persistent metadata, read-only, infinite size, visible to all, optimized for cost.</p>
<p>A common chain in practice is going from: <code>external table → temp/transient table → permanent managed table</code>.</p>
</blockquote>
<h3 id="how-a-classical-external-table-works">How a Classical External Table Works</h3>
<p>To understand how traditional external tables work, let&rsquo;s first look at Oracle, which has built an extensive syntax around them and where they still work this way today.</p>
<p>First, we can create a place for external data called <code>DIRECTORIES</code>, which is simply a pointer or alias to a file system location where external files already exist:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">OR</span><span class="w"> </span><span class="k">REPLACE</span><span class="w"> </span><span class="n">DIRECTORY</span><span class="w"> </span><span class="n">admin_dat_dir</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">AS</span><span class="w"> </span><span class="s1">&#39;/flatfiles/data&#39;</span><span class="p">;</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>This directory can point to local file systems, NFS mounts, or even cloud object storage today (with the <code>ORACLE_BIGDATA</code> driver for S3, OCI, Azure). The <code>DIRECTORIES</code> don&rsquo;t require moving data, though you could prepare those files via ETL pipelines or third-party tools, or they can be generated directly by applications.</p>
<p>We can now create an external table based on this directory, e.g., log files, bad data that we store externally, JSON files, and make data accessible inside the <a href="https://en.wikipedia.org/wiki/Information_schema" target="_blank" rel="noopener noreffer">INFORMATION_SCHEMA</a> and with plain SQL, as if it were internal.</p>
<p>Creating an external table:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span><span class="lnt">28
</span><span class="lnt">29
</span><span class="lnt">30
</span><span class="lnt">31
</span><span class="lnt">32
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">admin_ext_employees</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                   </span><span class="p">(</span><span class="n">employee_id</span><span class="w">       </span><span class="nb">NUMBER</span><span class="p">(</span><span class="mi">4</span><span class="p">),</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">                    </span><span class="n">first_name</span><span class="w">        </span><span class="n">VARCHAR2</span><span class="p">(</span><span class="mi">20</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                    </span><span class="n">last_name</span><span class="w">         </span><span class="n">VARCHAR2</span><span class="p">(</span><span class="mi">25</span><span class="p">),</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">                    </span><span class="n">job_id</span><span class="w">            </span><span class="n">VARCHAR2</span><span class="p">(</span><span class="mi">10</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                    </span><span class="n">manager_id</span><span class="w">        </span><span class="nb">NUMBER</span><span class="p">(</span><span class="mi">4</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                    </span><span class="n">hire_date</span><span class="w">         </span><span class="nb">DATE</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                    </span><span class="n">salary</span><span class="w">            </span><span class="nb">NUMBER</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span><span class="mi">2</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                    </span><span class="n">commission_pct</span><span class="w">    </span><span class="nb">NUMBER</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="mi">2</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                    </span><span class="n">department_id</span><span class="w">     </span><span class="nb">NUMBER</span><span class="p">(</span><span class="mi">4</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                    </span><span class="n">email</span><span class="w">             </span><span class="n">VARCHAR2</span><span class="p">(</span><span class="mi">25</span><span class="p">)</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">                   </span><span class="p">)</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">     </span><span class="n">ORGANIZATION</span><span class="w"> </span><span class="k">EXTERNAL</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">     </span><span class="p">(</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="k">TYPE</span><span class="w"> </span><span class="n">ORACLE_LOADER</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="k">DEFAULT</span><span class="w"> </span><span class="n">DIRECTORY</span><span class="w"> </span><span class="n">admin_dat_dir</span><span class="w">  </span><span class="c1">--notice this dir with above
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="k">ACCESS</span><span class="w"> </span><span class="k">PARAMETERS</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="p">(</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">         </span><span class="n">records</span><span class="w"> </span><span class="n">delimited</span><span class="w"> </span><span class="k">by</span><span class="w"> </span><span class="n">newline</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">         </span><span class="n">badfile</span><span class="w"> </span><span class="n">admin_bad_dir</span><span class="p">:</span><span class="s1">&#39;empxt%a_%p.bad&#39;</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">         </span><span class="n">logfile</span><span class="w"> </span><span class="n">admin_log_dir</span><span class="p">:</span><span class="s1">&#39;empxt%a_%p.log&#39;</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">         </span><span class="n">fields</span><span class="w"> </span><span class="n">terminated</span><span class="w"> </span><span class="k">by</span><span class="w"> </span><span class="s1">&#39;,&#39;</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">         </span><span class="n">missing</span><span class="w"> </span><span class="n">field</span><span class="w"> </span><span class="k">values</span><span class="w"> </span><span class="k">are</span><span class="w"> </span><span class="k">null</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">         </span><span class="p">(</span><span class="w"> </span><span class="n">employee_id</span><span class="p">,</span><span class="w"> </span><span class="n">first_name</span><span class="p">,</span><span class="w"> </span><span class="n">last_name</span><span class="p">,</span><span class="w"> </span><span class="n">job_id</span><span class="p">,</span><span class="w"> </span><span class="n">manager_id</span><span class="p">,</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">           </span><span class="n">hire_date</span><span class="w"> </span><span class="nb">char</span><span class="w"> </span><span class="n">date_format</span><span class="w"> </span><span class="nb">date</span><span class="w"> </span><span class="n">mask</span><span class="w"> </span><span class="s2">&#34;dd-mon-yyyy&#34;</span><span class="p">,</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">           </span><span class="n">salary</span><span class="p">,</span><span class="w"> </span><span class="n">commission_pct</span><span class="p">,</span><span class="w"> </span><span class="n">department_id</span><span class="p">,</span><span class="w"> </span><span class="n">email</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">         </span><span class="p">)</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="p">)</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="k">LOCATION</span><span class="w"> </span><span class="p">(</span><span class="s1">&#39;empxt1.dat&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;empxt2.dat&#39;</span><span class="p">)</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">     </span><span class="p">)</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">     </span><span class="n">PARALLEL</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">     </span><span class="n">REJECT</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="n">UNLIMITED</span><span class="p">;</span><span class="w"> 
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>The first and most important choice is <code>TYPE</code>, which determines the access driver and what kind of files you can read: <code>ORACLE_LOADER</code> for plain text files like CSV or logs (read-only), <code>ORACLE_DATAPUMP</code> for Oracle binary dump files, <code>ORACLE_BIGDATA</code> for cloud object stores like S3 or OCI in formats like Parquet or Avro, and <code>ORACLE_HIVE</code> for Hadoop/Hive data. The <code>DEFAULT DIRECTORY</code> points to a server-side path alias, and <code>LOCATION</code> names the actual file(s), with wildcard support (<code>*.dat</code>) so you can load a whole batch at once.</p>
<p>The <code>ACCESS PARAMETERS</code> block is where you control parsing: row and field delimiters, null handling, custom date format masks, and where to write bad rows (<code>badfile</code>) and parse logs (<code>logfile</code>). On top of that, <code>PARALLEL</code> lets Oracle split file reading across multiple processes for large files, and <code>REJECT LIMIT</code> controls fault tolerance. Set it to <code>UNLIMITED</code> to skip bad rows silently, or <code>0</code> to fail immediately on the first error.</p>
<p>You see lots of built-in features that we can use compared to building a full-fledged data pipeline. Instead of exporting and importing CSVs from the source databases or developing a complex CDC pipeline that traditionally looked something like: <code>source OLTP --&gt; CSVs --&gt; IDW (reports on yesterday) -&gt; ingest into DWH for long-term analytics</code>, we can just define a table based on external data and access it as part of our pipeline.</p>
<blockquote>
<p>[!tip] The INFORMATION_SCHEMA analogy</p>
<p>You are probably familiar with the INFORMATION_SCHEMA of a database. It&rsquo;s the <strong>internal data catalog</strong> that most databases provide and it contains a <strong>list of all tables and all metadata</strong> such as columns, data types, etc. The neat thing is that external tables will show up as internal tables once defined.</p>
</blockquote>
<h2 id="whats-the-modern-version-of-external-tables-today">What&rsquo;s the Modern Version of External Tables Today?</h2>
<p>To preface: the previous Oracle example shows the <code>CREATE EXTERNAL TABLE</code> syntax, and a first-class DDL object in the data catalog. What follows in this chapter is the next evolution, where external tables are not necessarily created with DDL, but in another way, achieving the same outcome of querying data in place without loading it. Let&rsquo;s see what these are.</p>
<h3 id="integrated-into-warehouses">Integrated into Warehouses</h3>
<p>Most modern warehouses - Snowflake, Redshift Spectrum, BigQuery, Athena, Synapse - come with a simplified version of <code>CREATE EXTERNAL TABLE</code>. Compared to the Oracle example, the schema is usually inferred from the file format (especially Parquet), S3 or another object store is the default backing location, and the parsing ceremony disappears. The pseudo-code looks roughly like this across engines:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="c1">-- Pseudo-code: modern external table over Parquet on S3
</span></span></span><span class="line"><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">EXTERNAL</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">sales</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">WITH</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">LOCATION</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;s3://my-bucket/sales/&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="n">FORMAT</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;PARQUET&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">);</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>Object storage like S3, GCS, and Azure Blob has become the first-class citizen for external data. From here, the ecosystem layers on: dbt wraps this in YAML, DuckDB skips the DDL entirely in favor of schema-on-read, and open table formats add transactional guarantees on top.</p>
<h3 id="external-tables-with-dbt">External Tables with dbt?</h3>
<p>On top of this base SQL form, dbt adds a YAML layer and can be used with its own package called <a href="https://github.com/dbt-labs/dbt-external-tables" target="_blank" rel="noopener noreffer"><code>dbt-external-tables</code></a>. It&rsquo;s one of the most-used dbt packages, though it seems less actively maintained now.</p>
<p>The external table is defined via YAML, and there are lots of options to set, with the most important being <code>external</code> and its <code>location</code>, but also defining <code>columns</code> in different ways such as inference or the <code>meta</code> tag:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span><span class="lnt">28
</span><span class="lnt">29
</span><span class="lnt">30
</span><span class="lnt">31
</span><span class="lnt">32
</span><span class="lnt">33
</span><span class="lnt">34
</span><span class="lnt">35
</span><span class="lnt">36
</span><span class="lnt">37
</span><span class="lnt">38
</span><span class="lnt">39
</span><span class="lnt">40
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">version</span><span class="p">:</span><span class="w"> </span><span class="m">2</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">sources</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">snowplow</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">tables</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">event</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="p">&gt;</span><span class="sd">
</span></span></span><span class="line"><span class="cl"><span class="sd">            This source table is actually a set of files in external storage.
</span></span></span><span class="line"><span class="cl"><span class="sd">            The dbt-external-tables package provides handy macros for getting
</span></span></span><span class="line"><span class="cl"><span class="sd">            those files queryable, just in time for modeling.
</span></span></span><span class="line"><span class="cl"><span class="sd">                            </span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">external</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">location:         # required</span><span class="p">:</span><span class="w"> </span><span class="l">S3 file path, GCS file path, Snowflake stage, Synapse data source</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="l">...              </span><span class="w"> </span><span class="c"># database-specific properties of external table</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">partitions</span><span class="p">:</span><span class="w">       </span><span class="c"># optional</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">collector_date</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">              </span><span class="nt">data_type</span><span class="p">:</span><span class="w"> </span><span class="l">date</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">              </span><span class="l">...          </span><span class="w"> </span><span class="c"># database-specific properties</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="c"># Specify ALL column names + datatypes.</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="c"># Column order must match for CSVs, column names must match for other formats.</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="c"># Some databases support schema inference.</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">columns</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">app_id</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="nt">data_type</span><span class="p">:</span><span class="w"> </span><span class="l">varchar(255)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Application ID&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">platform</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="nt">data_type</span><span class="p">:</span><span class="w"> </span><span class="l">varchar(255)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Platform&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="l">...</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="c"># Use `meta` to pass custom column properties (e.g. alias, expression)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">columns</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">raw_timestamp</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="nt">data_type</span><span class="p">:</span><span class="w"> </span><span class="l">timestamp</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="nt">config</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">              </span><span class="nt">meta</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                </span><span class="nt">alias</span><span class="p">:</span><span class="w"> </span><span class="l">event_timestamp      </span><span class="w"> </span><span class="c"># rename the column in the external table</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                </span><span class="nt">expression</span><span class="p">:</span><span class="w"> </span><span class="l">TO_TIMESTAMP(...)</span><span class="w"> </span><span class="c"># custom SQL expression instead of default value extraction</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>This is a nice improvement over the ODBC GUI interface. It&rsquo;s not exactly an apples-to-apples comparison as dbt itself is not a database, but with its supported destinations such as Redshift (Spectrum), Snowflake, BigQuery, Spark, Synapse, and Azure SQL, you see that it will persist in these destinations, mostly data warehouses.</p>
<h3 id="duckdb-with-dbt">DuckDB with dbt</h3>
<p>If you use dbt, you can also use DuckDB with dbt via <a href="https://github.com/duckdb/dbt-duckdb" target="_blank" rel="noopener noreffer">dbt-duckdb</a>, which is more up-to-date. But DuckDB is not an external table, right?</p>
<p>Yes, DuckDB doesn&rsquo;t have <code>CREATE EXTERNAL TABLE</code> syntax <a href="https://github.com/duckdb/duckdb/discussions/14422" target="_blank" rel="noopener noreffer">yet</a>, mostly because it is an in-memory database, but you can achieve the same functionality through other means. DuckDB can not only be used as a database but also as a zero-copy SQL connector (see all categories at <a href="/blog/enterprise-case-duckdb-key-categories/" rel="">5 Key Categories</a>). We can just point it to an external source, as shown above with dbt. The difference is that DuckDB is both a database and a compute engine, making ad-hoc reads possible directly without a DDL definition, similar to an external table with Oracle loaders. With dbt, we can nicely declare this in dbt configs.</p>
<p>With DuckDB, you can query &ldquo;external data&rdquo; extremely fast over HTTPS or locally in formats such as Parquet, CSV, and <a href="https://duckdb.org/docs/current/data/data_sources" target="_blank" rel="noopener noreffer">many more</a>, so the need for formal external tables is reduced since DuckDB does <strong>schema on read</strong>.</p>
<p>If you want to define the database schema ahead of time, we&rsquo;d use external tables to do that and effectively have <strong>schema on write</strong> (though we don&rsquo;t write, just define the DDL table structure and data types), which is more of the classical ETL approach.</p>
<p>Here&rsquo;s an example with <code>external_location</code> to read external data with dbt:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">sources</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">external_source</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">config</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">external_location</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;s3://my-bucket/my-sources/{name}.parquet&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">tables</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">source1</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>Read more at <a href="https://duckdb.org/2025/04/04/dbt-duckdb" target="_blank" rel="noopener noreffer">Fully Local Data Transformation with dbt and DuckDB</a>.</p>
<p>Other options are with database views that are supported in DuckDB with <strong><code>CREATE VIEW</code> over <code>read_parquet()</code></strong>. You can ship a .duckdb file to clients with pre-defined views over S3 data, so clients don&rsquo;t need to know about the underlying data, Hive partitioning, or even glob patterns — very similar to what a formal <code>CREATE EXTERNAL TABLE</code> would do.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">VIEW</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="k">AS</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">read_parquet</span><span class="p">(</span><span class="s1">&#39;s3://lake/events/*.parquet&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">hive_partitioning</span><span class="o">=</span><span class="k">true</span><span class="p">);</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>Or similarly use <code>ATTACH</code> to directly point to Postgres, MySQL, SQLite, S3, and others:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="c1">-- Postgres (binary wire protocol, predicate + projection pushdown, read+write)
</span></span></span><span class="line"><span class="cl"><span class="n">INSTALL</span><span class="w"> </span><span class="n">postgres</span><span class="p">;</span><span class="w"> </span><span class="k">LOAD</span><span class="w"> </span><span class="n">postgres</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="n">ATTACH</span><span class="w"> </span><span class="s1">&#39;dbname=postgres user=postgres host=127.0.0.1&#39;</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">pg</span><span class="w"> </span><span class="p">(</span><span class="k">TYPE</span><span class="w"> </span><span class="n">postgres</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="n">ATTACH</span><span class="w"> </span><span class="s1">&#39;postgresql://user@host/db&#39;</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">pg</span><span class="w"> </span><span class="p">(</span><span class="k">TYPE</span><span class="w"> </span><span class="n">postgres</span><span class="p">,</span><span class="w"> </span><span class="n">READ_ONLY</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">-- MySQL (via MariaDB Connector/C; Postgres-style keyvalue string even for MySQL — easy trap)
</span></span></span><span class="line"><span class="cl"><span class="n">INSTALL</span><span class="w"> </span><span class="n">mysql</span><span class="p">;</span><span class="w"> </span><span class="k">LOAD</span><span class="w"> </span><span class="n">mysql</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="n">ATTACH</span><span class="w"> </span><span class="s1">&#39;host=localhost user=root port=0 database=mysql&#39;</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">mdb</span><span class="w"> </span><span class="p">(</span><span class="k">TYPE</span><span class="w"> </span><span class="n">mysql</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">-- SQLite (file opens directly; multi-reader single-writer by SQLite file locks)
</span></span></span><span class="line"><span class="cl"><span class="n">INSTALL</span><span class="w"> </span><span class="n">sqlite</span><span class="p">;</span><span class="w"> </span><span class="k">LOAD</span><span class="w"> </span><span class="n">sqlite</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="n">ATTACH</span><span class="w"> </span><span class="s1">&#39;sakila.db&#39;</span><span class="w"> </span><span class="p">(</span><span class="k">TYPE</span><span class="w"> </span><span class="n">sqlite</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">-- Generic remote DuckDB file
</span></span></span><span class="line"><span class="cl"><span class="n">ATTACH</span><span class="w"> </span><span class="s1">&#39;s3://duckdb-blobs/databases/stations.duckdb&#39;</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">stations_db</span><span class="p">;</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><h3 id="open-table-formats-and-lakehouse-architecture">Open Table Formats and Lakehouse Architecture</h3>
<p>That begs the question of whether <a href="https://motherduck.com/blog/open-lakehouse-stack-duckdb-table-formats/" target="_blank" rel="noopener noreffer">Open Table Formats</a> are the next evolution and modern way of external tables. These table formats allow almost any SQL compute engine to use them as external tables, and read, compute, and aggregate as a database would.</p>
<p>If we look at what table formats consist of, they&rsquo;re built on object storage, with a file format like Parquet, and then we have a manifest file that contains a list of files that <strong>unifies multiple single files into a &ldquo;single&rdquo; table</strong>, looking from the outside.</p>
<p>So again, the manifest file is our pointer or fancier symlink, but it lives next to the data, unlike external tables. There&rsquo;s much more going on in table formats, but if we have a <strong>data lake with open table format tables</strong>, we can see how we define tables in DDL and the <strong>pointers are to different files</strong> (Parquet, ORC, Avro), in most cases Parquet.</p>
<p>More broadly, we can say external tables decouple storage from compute. Open table formats decouple the table itself (schema, history, transactions, statistics) from any single engine.</p>
<h3 id="lakehouse-and-connecting-to-ducklake">Lakehouse and Connecting to DuckLake</h3>
<p>One step further is obviously a lakehouse architecture, with the shift from <em>format-agnostic file reading</em> to <em>governed, transactional, multi-engine open table formats</em>.</p>
<p>If you extend the external table idea to a <a href="https://motherduck.com/blog/from-data-lake-to-lakehouse-duckdb-portable-catalog/" target="_blank" rel="noopener noreffer">lakehouse architecture</a>, these external tables with open table formats provide essentially what databases provide with ACID guarantees, time travel, schema evolution, partition evolution, and fine-grained access control, but for files.</p>
<p>But with the difference that data stays in open Parquet file format on customer-owned cloud storage. The external table, once a humble workaround for avoiding data loads, has become the architectural foundation of the data lakehouse if you like this analogy.</p>
<p>With <a href="https://ducklake.select/" target="_blank" rel="noopener noreffer">DuckLake</a>, we have the next evolution just around the corner, bringing back exactly that missing database, especially to handle all the metadata of such a lakehouse and all its files. This means having durable and consistent database storage for our <a href="https://iceberg.apache.org/spec/#manifests" target="_blank" rel="noopener noreffer">manifest files</a>.</p>
<h4 id="open-data-catalog-to-complete-the-picture-the-odbc-glue">Open Data Catalog to Complete the Picture: The ODBC Glue</h4>
<p>With all these evolutions, we&rsquo;ve come far. When adding an <a href="https://www.ssp.sh/brain/open-table-format-catalogs" target="_blank" rel="noopener noreffer">Open Data Catalog</a>, we are exactly where we started: having an INFORMATION_SCHEMA, a dictionary with all our tables, in this case the open table format tables.</p>
<p>It&rsquo;s the <strong>glue that ODBC provided when connecting a BI tool to the underlying database</strong>. Now you&rsquo;d like to have an open data catalog that, in the best-case scenario, gives you all the tables and ways to connect.</p>
<p>But then again, the syntax of <code>EXTERNAL TABLES</code> still gets added, and <a href="https://arrow.apache.org/docs/format/ADBC.html" target="_blank" rel="noopener noreffer">ADBC</a> and DuckDB are doing a great job of using external data without needing a data lake and its technology stack altogether. For example, DuckDB has support for <a href="https://duckdb.org/docs/current/core_extensions/odbc/overview" target="_blank" rel="noopener noreffer">ODBC</a>, <a href="https://duckdb.org/docs/current/clients/adbc" target="_blank" rel="noopener noreffer">ADBC</a> and even <a href="https://duckdb.org/docs/current/clients/java" target="_blank" rel="noopener noreffer">JDBC</a>. That matters especially for 3rd-party tools: ADBC streams Apache Arrow end-to-end instead of serializing row-by-row, so BI tools and notebooks can pull millions of rows directly from external Parquet tables at speeds that previously required keeping data &ldquo;hot&rdquo; in a cloud data warehouse. 🙂</p>
<blockquote>
<p>[!note] ADBC, what is that?<br>
ODBC is 30+ years old, and we have a newer, faster version of it, called <a href="https://arrow.apache.org/docs/format/ADBC.html" target="_blank" rel="noopener noreffer">ADBC</a>. It&rsquo;s a faster way to connect to other databases with a columnar-oriented API instead of <strong>row-by-row serialization</strong>, heavily making use of Apache Arrow.</p>
<p>While ADBC is newer, it tries to support the same drivers as ODBC, but faster and easier to install. E.g., it has a handy <a href="https://github.com/columnar-tech/dbc" target="_blank" rel="noopener noreffer">dbc</a> CLI to install it on almost any programming language, so no more manual and error-prone Windows GUI ODBC downloading of drivers and definitions needed, just one CLI command.</p>
</blockquote>
<blockquote>
<p>[!tip] Using MotherDuck<br>
If you want a data warehouse that just works, integrates well with DuckDB, and has support for DuckLake, you can always use managed MotherDuck. You can build a classical data warehouse with plain SQL, you can read external data easily with DuckDB or dbt-duckdb, or <a href="https://motherduck.com/blog/announcing-ducklake-1-0-on-motherduck/" target="_blank" rel="noopener noreffer">integrate with DuckLake</a>.</p>
<p>It works great <a href="https://motherduck.com/blog/motherduck-agent-skills/" target="_blank" rel="noopener noreffer">with agents</a>. Check out MotherDuck&rsquo;s <a href="https://github.com/motherduckdb/agent-skills/" target="_blank" rel="noopener noreffer">agent-skills</a> for opinionated AI skills for building applications with MotherDuck. And <a href="https://motherduck.com/product/dives/" target="_blank" rel="noopener noreffer">visualize with Dives</a> with one prompt.</p>
</blockquote>
<h2 id="which-is-faster-a-quick-benchmark">Which Is Faster? A Quick Benchmark</h2>
<p>To put numbers behind the hot/cold decision, I ran a simple benchmark on the TPC-H SF=1 <code>lineitem</code> table (6M rows, ~150 MB), stored four ways: inside a DuckDB file (internal), as raw Parquet, as an Iceberg table, and as a DuckLake table. Full code: <a href="https://github.com/sspaeti/external-table-benchmark/blob/main/bench2.py" target="_blank" rel="noopener noreffer"><code>bench2.py</code></a> and <a href="https://github.com/sspaeti/external-table-benchmark/blob/main/metadata_bench.py" target="_blank" rel="noopener noreffer"><code>metadata_bench.py</code></a>.</p>
<p><strong>Dashboard workload (hot path)</strong>: 3 queries × 10 repeats:</p>
<table>
  <thead>
      <tr>
          <th>Backend</th>
          <th>Tier</th>
          <th>Median</th>
          <th>p95</th>
          <th>vs internal</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Internal (DuckDB)</td>
          <td>hot</td>
          <td>23.8 ms</td>
          <td>235 ms</td>
          <td><strong>1.0×</strong></td>
      </tr>
      <tr>
          <td>DuckLake</td>
          <td>cold</td>
          <td>45.1 ms</td>
          <td>269 ms</td>
          <td>1.3×</td>
      </tr>
      <tr>
          <td>External Parquet</td>
          <td>cold</td>
          <td>41.3 ms</td>
          <td>271 ms</td>
          <td>1.4×</td>
      </tr>
      <tr>
          <td>External Iceberg</td>
          <td>cold</td>
          <td>56.1 ms</td>
          <td>377 ms</td>
          <td>1.7×</td>
      </tr>
  </tbody>
</table>
<p>Internal is fastest; external pays a 1.3×–1.7× tax. But for <strong>cold/archival queries</strong> (one-off, no warmup), all four backends answered in under 150 ms. The speed difference effectively vanishes for data you query once a week.</p>
<p><strong>Storage cost</strong> is where external tables shine. Columnar Parquet is ~40% smaller than native DuckDB format. Ten TB of archive data costs roughly ~$125/month on S3 Infrequent Access or ~$10/month on Glacier Deep Archive, versus ~$230/month inside Snowflake on capacity pricing. This is the economic case external tables were invented for, and it still holds.</p>
<p><strong>Metadata workload</strong> is where DuckLake stands out. Fifty single-row inserts showed DuckLake creating <strong>zero data files</strong> (rows inlined in the catalog) versus Iceberg&rsquo;s <strong>352 files</strong> (201 data + 151 metadata). That&rsquo;s the &ldquo;small file problem&rdquo; made concrete: at one write per second, Iceberg creates ~86,400 files per day needing compaction. DuckLake creates zero until you checkpoint. DuckDB Labs&rsquo; own benchmarks report up to <a href="https://ducklake.select/2026/04/02/data-inlining-in-ducklake/" target="_blank" rel="noopener noreffer">926× faster queries</a> on streaming workloads.</p>
<h2 id="so-should-you-use-external-tables">So Should You Use External Tables?</h2>
<p>So after all this, should you use external tables today? After seeing how sticky they&rsquo;ve been since Oracle 9i in 2001, how they keep getting re-added to newer tools (Snowflake in 2021, Databricks Unity Catalog, BigLake in 2022), and how their core benefit is. Accessing data where it lives without moving it, via a simple DDL statement, has only grown more valuable as formats have evolved from CSV to Parquet, JSON, Avro, and now open table formats. I&rsquo;d say yes. But choose wisely based on your data&rsquo;s temperature: use internal storage for hot data, such as dashboards and frequently used queries.</p>
<p>Use external tables for cold data, archival workloads, and ad-hoc exploration, where that gap vanishes, and storage costs plummet (up to 20× cheaper on Glacier Deep Archive vs. warehouse-managed storage). And if you already use dbt, DuckDB, or a lakehouse stack, the modern versions are right there. Where they&rsquo;re the <em>wrong</em> choice is the inverse: transactional workloads, queries that need sub-second latency on every run, or data so small that the operational overhead of an external stage outweighs the benefit of not loading it.</p>
<p>The evolution is worth naming explicitly: &ldquo;read CSVs on disk&rdquo; → &ldquo;read Parquet on HDFS&rdquo; → &ldquo;read Parquet on S3 via a metastore&rdquo; → &ldquo;read Iceberg/Delta tables with ACID on S3&rdquo; → &ldquo;the Iceberg table <em>is</em> the warehouse table&rdquo;. Each step kept the core idea (data stays where it lives, metadata describes it, SQL queries it) and added database semantics back in. With open data catalogs, the warehouse becomes a <strong>stateless rental over a bucket you own</strong>, and external tables are increasingly managed. DuckLake demonstrates this best: when the catalog has SQL-DB-like guarantees, the distinction between &ldquo;external&rdquo; and &ldquo;internal&rdquo; dissolves. The metadata benchmark made this concrete by reading a single indexed row rather than walking a manifest tree.</p>
<p>The <strong>database semantics are returning</strong> with DuckLake, managed Iceberg, and predictive optimization, all of which reintroduce RDBMS-style guarantees to the lake. The cycle from &ldquo;external table for cheap storage&rdquo; to &ldquo;external table as a full ACID database on S3&rdquo; took 25 years, completing the journey back to database principles while maintaining the separation of storage and compute. You can say <strong>the modern external table isn&rsquo;t external anymore</strong>. DuckDB reads them directly, and DuckLake handles the metadata that multifile lakehouse architectures would otherwise drown in. The lesson from history is that whenever someone tries to replace it, the pattern is that reading data in place always beats moving it. And the Lindy Effect suggests that if external tables have lasted 25 years and get re-added, they&rsquo;ll persist another 25. They&rsquo;re probably not going anywhere. 🙂</p>
<hr>
<pre class=""><em>Full article published at <a href="https://motherduck.com/blog/internal-vs-external-storage-whats-the-limit-of-external-tables/" target="_blank" rel="noopener noreferrer">MotherDuck.com</a> - written as part of <a href="/services">my services</a></em></pre>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>A so-called loader that lets you access the data via a driver: see the ORACLE_LOADER Access Driver example: <a href="https://docs.oracle.com/en/database/oracle/oracle-database/12.2/sutil/oracle_loader-access-driver.html" target="_blank" rel="noopener noreffer">https://docs.oracle.com/en/database/oracle/oracle-database/12.2/sutil/oracle_loader-access-driver.html</a>&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>Also see the latest release notes of BigQuery from April 2026; lots of it has to do with &ldquo;external catalogs&rdquo; and also BigQuery Apache Iceberg external tables now support Iceberg version 3: <a href="https://docs.cloud.google.com/bigquery/docs/release-notes#April_21_2026" target="_blank" rel="noopener noreffer">https://docs.cloud.google.com/bigquery/docs/release-notes#April_21_2026</a>&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3">
<p>Customers can also choose to create materialized views on external tables to speed up the query performance significantly.&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></description>
</item>
<item>
    <title>AI Reveals Why BI Still Matters</title>
    <link>https://www.ssp.sh/blog/bi-is-not-dead-2026/</link>
    <pubDate>Tue, 21 Apr 2026 08:41:06 &#43;0200</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/bi-is-not-dead-2026/</guid><enclosure url="https://www.ssp.sh/blog/bi-is-not-dead-2026/featured-image.jpg" type="image/jpeg" length="0" /><description><![CDATA[<p>Ask a BI engineer what they actually spend their time on: it&rsquo;s not building dashboards. More often: fixing the join that broke in the overnight pipeline, untangling the metric definition that means three different things to three different teams, or getting last week&rsquo;s numbers into an Excel by Monday morning. The dashboard was always the easy part.</p>
<p>This article looks at how BI evolved, how dashboards are actually used today, and what survives when AI enters the picture — starting with the foundation that was never really about dashboards in the first place, and ending with the problem nobody in the AI hype cycle wants to talk about: who maintains it all.</p>
<h2 id="the-verdict-of-people-in-the-field-bi-is-dead-again">The Verdict of People in the Field: BI is Dead (Again)</h2>
<p>We&rsquo;ve heard it all. Business intelligence (BI), and especially dashboards, are dead. But every time, only to rediscover its power and resurrection whenever we need grounded data analysis in any enterprise and startup space. The same way Excel never dies, which arguably is still the most used BI tool.</p>
<p>If we look at what others from the data world say, it does sound similar. Hex says that <a href="https://hex.tech/blog/dashboards-were-never-the-destination/" target="_blank" rel="noopener noreffer"><strong>dashboards were never the destination</strong></a> and specifies:</p>
<blockquote>
<p>Static reporting surfaces were always a workaround for messy data and limited tooling — agentic analytics finally closes the gap between visibility and genuine insight.</p>
</blockquote>
<p>In other words, dashboards were a workaround created because data was messy, tooling was restrictive, and asking open-ended questions of the warehouse wasn’t possible without coding. Hex continues that dashboards create more questions than they answer.</p>
<p>But they also say: respect dashboards, but stop treating them as the goal:</p>
<blockquote>
<p>Dashboards still matter. They’re <strong>excellent for reporting KPIs</strong>, surfacing <strong>operational signals</strong>, and <strong>aligning leaders</strong> around shared metrics. They allow data teams to be creative in how they display these metrics and will remain a primary surface for need-to-know numbers.</p>
</blockquote>
<p>Dashboards don&rsquo;t reason or explain why something happens, as the Hex article says. Hex argues that <strong>reliable, timely, context-aware answers</strong> are the destination.</p>
<p>Mike, CEO of Rill, <a href="https://www.linkedin.com/posts/medriscoll_agents-are-blind-they-cant-see-dashboards-activity-7442685222657277952-LiRw/" target="_blank" rel="noopener noreffer">says</a>:</p>
<blockquote>
<p><strong>Agents are blind</strong>—they can’t see dashboards. But they do need access to the primitives behind them. AI, and agentic working won’t kill BI, but it will make dimensional modeling, metrics, OLAP cubes, query performance, and governance more important than pretty charts. Agents will reveal what many of us have known for a long time: <strong>BI was never about dashboards</strong>.</p>
</blockquote>
<p>Benn Stancil was already saying in 2021 that <a href="https://benn.substack.com/p/is-bi-dead" target="_blank" rel="noopener noreffer">BI is dead</a>, where he drew parallels to the Salesforce &ldquo;End of Software&rdquo; declaration in 2000. This is very interesting as it&rsquo;s almost the same statement today with AI, no more software developers needed. Benn argues that the original BI stack was one tool across the stack, but that by 2021, tools were unbundling into <strong><a href="https://benn.substack.com/p/is-bi-dead" target="_blank" rel="noopener noreffer">dedicated and specialized tools for each layer</a></strong>.</p>
<p>The proposed future of BI should focus <em>only</em> on data consumption for humans, integrating both self-serve applications and deep ad-hoc analysis. This new BI would be &ldquo;legless&rdquo; (the opposite of headless, he argued back then), relying on global governance layers rather than proprietary semantic layers, and fostering cross-functional collaboration.</p>
<p>The aim is a universal tool for all data consumption, moving beyond its current diminished state of &ldquo;visualization and reporting&rdquo;. Again, very similar to today&rsquo;s discussion, where everything is about semantics and context. Benn concluded that &ldquo;<strong>BI is dead, long live BI</strong>&rdquo; 🙂.</p>
<p>Take any modern data and analytics discipline, and you’ll probably find it has its roots in the work that has been historically carried out by Business Intelligence developers, the OG jack-of-all-trades of the data industry.</p>
<blockquote>
<p>[!note] Find more Opinions and Articles</p>
<p>Other opinions on the world wide web, whether <a href="https://www.strategy.com/software/blog/bi-is-dead-long-live-business-intelligence" target="_blank" rel="noopener noreffer">BI is dead. Long live business intelligence.</a>, or a discussion on Reddit about <a href="https://sh.reddit.com/r/analytics/comments/1nkc9gt/will_business_intelligence_skills_bi_be/" target="_blank" rel="noopener noreffer">Will Business Intelligence skills (BI) be irrelevant in like 3-4 years? : r/analytics</a> or <a href="https://sh.reddit.com/r/dataengineering/comments/1s9gd8f/ai_kill_bi/" target="_blank" rel="noopener noreffer">AI kill BI?</a>.</p>
</blockquote>
<h2 id="bi-was-never-just-about-dashboards">BI Was Never Just About Dashboards</h2>
<p>With respect to Benn&rsquo;s article in 2021, have we come full circle five years later, with the end of software engineering again, and everyone demanding semantics and metrics layers?</p>
<p>At least on the surface, it seems we are at the same point, but today we&rsquo;re going back to one &ldquo;Original BI&rdquo; stack as drawn in the image in the article, to a fully encompassed data platform. Maybe it does not need to be one single platform, but at least to the user it needs to be a <strong>single chat or AI interface</strong>, that goes end-to-end through all the layers of discovery, visualization, transformation, storage, and ingestion—or in other words, the full data engineering lifecycle.</p>
<p>So what about dashboards then? Many declare the death of many things, and dashboards are a popular target too. That&rsquo;s even more true when AI, with its generative capabilities, can just one-shot your whole dashboard and create a full-blown web app or custom HTML page. And dashboards might die, as many <strong>don&rsquo;t actually want the dashboard</strong> itself, but the extracts from your large SAP, linked to the right customers from the CRM, enhancing the decision-making process even more.</p>
<p>They want the insights from the combination of all source data and business insights that your company has over its competitor, for example. It&rsquo;s the <strong>primitives behind the dashboards</strong> that matter more.</p>
<h4 id="when-you-still-need-a-dashboard-ai-chat-is-not-enough">When You Still Need a Dashboard (AI Chat is Not Enough)</h4>
<p>Even though a chat interface or an agent can provide you with <strong>dashboard information tailored to your question</strong> and in an explanatory written form, there&rsquo;s still a need for dashboards in certain situations.</p>
<p>The obvious one is the well-crafted <strong>operational dashboard</strong>, where you can see your whole company performing in a split second by looking at a single, highly dense dashboard with individual charts and visualizations tailored to convey information about each sub-area in the best possible way. It&rsquo;s the same way a map is still needed in self-driving cars: for quick verification, to get an overview, or in case the car gets lost.</p>













  
<figure><a target="_blank" href="/blog/bi-is-not-dead-2026/car-cockpit.webp" title="">

</a><figcaption class="image-caption">A good example of operational dashboard in a car, where you need the numbers at all time | <a href="https://x.com/kyleanthony/status/2042572700468511133" target="_blank" rel="noopener noreffer">Tweet</a></figcaption>
</figure>
<p>The other obvious benefit of the operational dashboard, or any general dashboard, is that people can agree on numbers, as they&rsquo;re looking at the same agreed-upon dashboard. Hence, the calculations are the same compared to individual Excel sheets with different calculations.</p>
<p>The easy creation of multiple dashboards on the fly or chatting with AI to get insights resembles the <strong>old way of using local Excel files</strong>. Everyone is doing their own thing, with no alignment, governance, or broader verification.</p>
<p>Maps are a different type of dashboard that can&rsquo;t be replaced. Geospatial data shown on a map still beats text. Or a bar chart where you immediately scan the different proportions of stock in inventory across certain regions, compared to first having to analyze all numbers in a text-only chat reply.</p>
<h4 id="dashboards-as-a-sanity-check">Dashboards as a Sanity Check</h4>
<p>Maybe some even less obvious reasons: <strong>serendipitous discovery</strong> of anomalies and outliers that accidentally pop up in dashboards are harder to see in chats, but easy to spot visually. The same goes for <strong>ad-hoc BI</strong> when drilling down into more grain and dimensions for self-service users. A pivot table is a REPL for BI; that&rsquo;s not possible with chats.</p>
<p>Dashboards are also a lifeline to quickly check if the AI is hallucinating. What about <strong>determinism</strong>? Chat responses are not deterministic, and you might get different responses, hopefully with the same correct answer, but visualized differently because the model made different decisions the second time, or for a different user. AI agents are non-deterministic statistical models and probably always will be, so we need to bring more context and definitions to make outputs more consistent. One way is with a spec-driven development (SDD) first approach that helps define it more accurately each time.</p>
<p>The same applies to <strong><a href="https://www.rilldata.com/blog/data-modeling-for-the-agentic-era-semantics-speed-and-stewardship" target="_blank" rel="noopener noreffer">stewardship</a></strong>, reviewing and checking outcomes. As an example consider self-driving cars such as Waymos: you don&rsquo;t need a map, but if it&rsquo;s wrong or stuck, the first thing you&rsquo;d reach for is one.</p>
<p>But what does that mean for the future of BI, and what keeps BI afloat?</p>
<h2 id="the-primitives-of-a-dashboard-and-bi">The Primitives of a Dashboard (and BI)</h2>
<p>BI was and is never (only) about dashboards. BI is used for a combination of reasons. One of the most important is the <strong>metrics</strong> themselves, the business KPIs: defining your profit, defining what has to be deducted, how much taxes and salary payments are. All of it is ingrained in a single metric, or better, all of it including the hierarchy, the full <a href="https://www.youtube.com/watch?v=Dbr8jmtfZ7Q" target="_blank" rel="noopener noreffer">tree of metrics</a>, as it&rsquo;s never just one metric.</p>
<p>But the metrics tree with all its hierarchy is not all. We need <strong>speed to crunch</strong> the data extracted from the source SAP and CRM, cleansed and joined, and aggregated to the exact grain of the metric. No one has a single place to view their data end-to-end, which makes BI work so needed, integrating to either combine all sources into one API or database/warehouse for the business (or now AI chat) to pull from. Again, queries over CSV are easy. Fast queries over 1 TB of Parquet are hard.</p>
<p>This is where we need optimal data modeling and compute power to do it in seconds, best case sub-second. And when we have this, we need a good data architecture with the right tool to compress and compute that data. That&rsquo;s where we need some kind of <strong>OLAP</strong> or analytical database, optimized for analytical queries and doing aggregation on the fly based on different dimensions and grains.</p>
<p>All of these primitives are still needed, maybe even more so in times of generative AI, even if dashboards were to go away.</p>
<p>Lastly, all of this is under the umbrella of <strong>context</strong> and <strong>semantics</strong>. It&rsquo;s encoding the business-related processes into data-like artefacts to make reusable governed definitions of metrics like revenue, MAU (Monthly Active Users), ROAS (Return on Ad Spend), etc. And the right medium for that is metrics and a so-called metrics layer (or semantic layer).</p>
<h2 id="everyone-wants-to-build-dashboards-nobody-wants-to-maintain-them">Everyone Wants to Build Dashboards. Nobody Wants to Maintain Them.</h2>
<p>The elephant in the room that nobody talks about is that <strong>everyone wants to build. Nobody wants to maintain</strong>. This is the current AI phase we&rsquo;re in. Even more, nobody even wants to review tons of AI-generated code.</p>
<p>According to the story in the book <a href="https://www.amazon.com/Maintenance-Everything-Part-One/dp/1953953492" target="_blank" rel="noopener noreffer">Maintenance of Everything</a> by Stewart Brand, what makes people do maintenance is when they <strong>find joy</strong> in it and care for it.</p>
<p>This is greatly illustrated in the opening with the 1968 Golden Globe solo sailboat race, a dramatic contest of maintenance styles under life-or-death conditions, contrasting Robin Knox-Johnston&rsquo;s meticulous upkeep with Donald Crowhurst&rsquo;s neglect. Knox-Johnston, who truly <strong>enjoyed</strong> his boat <em>Suhaili</em>, reported mid-race that despite the brutal Southern Ocean ordeal, he was &ldquo;thoroughly enjoying himself&rdquo;. Decades later, he personally restored her, replacing every one of her 1,400 fastenings.</p>
<p>Stewart Brand also argues that you need to <strong>care</strong> about your product, and that&rsquo;s the key to good maintenance — a lesson drawn from <a href="https://en.wikipedia.org/wiki/Zen_and_the_Art_of_Motorcycle_Maintenance" target="_blank" rel="noopener noreffer">Zen and the Art of Motorcycle Maintenance</a> by Robert Pirsig (1974), a philosophical classic celebrated as bending the culture of the day toward honoring maintenance.</p>
<p>And one way to ease maintenance is to <strong>design maintenance-friendly</strong>. Like the <a href="https://en.wikipedia.org/wiki/Rolls-Royce_Silver_Ghost" target="_blank" rel="noopener noreffer">Rolls-Royce Silver Ghost</a>, which was engineered from the ground up for reliability and ease of upkeep.</p>
<p>So how do we apply these principles to BI? We need to create dashboards that are easy to maintain and have clear ownership, someone who cares. Otherwise, they lose long-term value, and we invest precious time in something short-term.</p>
<blockquote>
<p>[!quote] Quote by <a href="https://www.linkedin.com/posts/mehd-io_you-just-burned-1k-in-tokens-to-rebuild-share-7439331001383800832-ZVhN?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAABkA2pgBYM4xDO0z2ChYuxFhBfu4h7jp4Lo" target="_blank" rel="noopener noreffer">Mehdi</a><br>
You just burned $1k in tokens to rebuild the SaaS you refused to pay $29/month for.<br>
Great for the ego. Now maintain it for 3 years to break even.<br>
Yup, maintenance is a luxury these days.</p>
</blockquote>
<p>Generating 100s of cool but unused dashboards with AI is clearly working in the opposite direction. We can save a couple of bucks while no longer needing to maintain each custom-created dashboard. Indeed, <strong>maintenance is a luxury</strong> these days.</p>
<h4 id="self-maintaining-agents">Self-Maintaining Agents?</h4>
<p>The question is, do we need agents for the maintenance then? So that we can create new, innovative solutions, do the creative things, and leave the hard parts of BI to us humans such as getting requirements and verifying with the business. And we create agents to do the maintenance? What does maintenance even mean? What does <strong>changing the oil</strong> and checking the brakes mean for BI, or data pipelines?</p>
<p>It&rsquo;s not troubleshooting in case of an error, that&rsquo;s what agents already do and help a lot with, ideally fixing the errors themselves, in a self-healing process. But maintenance is different. <strong>Keeping up with the updates</strong> of your software, security, the right integration into your data platform with upgraded, better-performing glue code, avoiding code from becoming legacy code.</p>
<p>All of it means maintenance, and outweighs the work of just creating the data pipeline or BI dashboard by far, probably somewhere on the order of 8:2.</p>
<p>As Mike puts it well when we were discussing:</p>
<blockquote>
<p>Few can build a <strong>maintainable</strong>, scalable data infrastructure for surfacing trusted metrics. E.g. a digital platform like Coinbase isn&rsquo;t going to YOLO its internal reporting over billions of transactions. Even Claude has a usage-based billing portal, consumption metrics need to be precise, deterministic, and fast.</p>
</blockquote>
<p>The Basecamp owner similarly <a href="https://youtu.be/otvGsbeOdfc?si=2p9X7ILxJSsuJzpA&amp;t=950" target="_blank" rel="noopener noreffer">said</a>: AI is pushing back too little. Maybe it will be solved in the future with better models, but we need to live in the present, and in that present:</p>
<blockquote>
<p><strong>Agents don&rsquo;t finish beautiful, ergonomic, desirable software.</strong> They just don&rsquo;t. That human finishing at the end is not just necessary, it&rsquo;s essential.</p>
</blockquote>
<p>So the future is soon, but not yet. Back to dashboards and their maintenance: the hard part is not generating visualizations, but having metrics and a strong BI backend. Almost a unified data interface that has an agent with access to source, ETL, and dashboard.</p>
<h2 id="bi-as-code-one-solution-to-maintenance">BI-as-Code: One Solution to Maintenance?</h2>
<p>Is the solution to maintenance-friendly design maybe BI-as-Code?</p>
<p>BI-as-Code comes into play because declarative configs can be versioned and maintained, thereby avoiding the limitations of BI-as-clicks. Sure, it will not solve all problems, but having that descriptive state of A. your data infrastructure and B. your data pipelines and BI dashboards helps tremendously. In the event of an error or incorrect state, we can just roll back to the last versioned dashboard or infrastructure.</p>
<p>The only thing hard to make reversible, unless you use some kind of <a href="https://www.ssp.sh/blog/git-for-data-theory/" target="_blank" rel="noopener noreffer">Git for Data</a> workflow with LakeFS, Nessie, or others, or just use Open Table Formats with the Time Travel function, is the actual data.</p>
<p>BI-as-Code isn&rsquo;t the whole answer, but it&rsquo;s the right direction: making dashboards owned, <strong>versioned, and recoverable</strong>. Code can build the right level of <strong>abstractions</strong> for ETL, metrics queries (metrics SQL), and visualizations, where raw Python for ETL or D3 is too verbose and too brittle.</p>
<p>With agents, these abstractions come in handy once more, as agents work best with a clear interface like a CLI or API, where the abstraction helps build just that, and tune things themselves through MCP or direct access to the declarative configurations. Much of what <a href="https://www.rilldata.com/blog/bi-as-code-and-the-new-era-of-genbi" target="_blank" rel="noopener noreffer">GenBI</a> is all about. The question of what comes next: can agents take over the analyst role entirely, or how do we marry the two?</p>
<h3 id="building-bi-for-agents-not-humans">Building BI for Agents, Not Humans</h3>
<p>BI-as-Code allows agents to drive BI, or as Mike <a href="https://www.linkedin.com/posts/medriscoll_observability-cybersecurity-product-analytics-share-7444397754635968512-nZsM" target="_blank" rel="noopener noreffer">said</a>: &ldquo;AI drives <strong>compression of the data stack</strong>&rdquo;.  Meaning observability, cybersecurity, product analytics, and BI are converging. A CEO recently asked him why he couldn&rsquo;t &ldquo;kill Tableau, Looker, DataDog, Grafana, and QuickSight&rdquo; in favor of a single system. In my opinion, it doesn&rsquo;t need to be a single tool, but it should feel like a single interface.</p>
<p>Most common today: a chat prompt that autonomously spawns ingestion, transforms data into marts, and surfaces a dashboard or web app, <strong>running end-to-end analytics</strong> without the user ever thinking about the layers underneath.</p>
<p>But only the speed of faster building with AI won&rsquo;t get us there. <strong><a href="https://en.wikipedia.org/wiki/Amdahl%27s_law" target="_blank" rel="noopener noreffer">Amdahl&rsquo;s Law</a> still applies</strong>, as Jeff Dean rightly said in his <a href="https://www.youtube.com/watch?v=g8BuAtM3fp4" target="_blank" rel="noopener noreffer">talk</a> with NVIDIA&rsquo;s Bill Dally:</p>
<blockquote>
<p>An AI agent can run 50x faster, but if the tools it depends on were designed for human speed — slow query APIs, brittle CLIs, unversioned metrics — the overall gain collapses to 2-3x.</p>
</blockquote>
<p>That&rsquo;s why the primitives behind BI get <em>more</em> important as agents get faster, not less. OLAP needs to be faster, metrics need a reliable API, ETL needs to be composable. The bottleneck shifts from the model to the infrastructure it runs on.</p>
<p>And when agents do take over the analyst workflow, spawning parallel queries, discarding dead ends, surfacing the interesting slices, they&rsquo;ll expose something BI practitioners have known for years: <strong>the hard part was never the visualization</strong>.</p>
<p>It was always the semantics beneath — the governed metrics, the trusted definitions, and configs that were verified by an actual human being. Agents will just make that gap impossible to ignore.</p>
<h2 id="bi-primitives-are-infrastructure-for-ai">BI primitives are Infrastructure for AI</h2>
<p>Wrapping up, BI was never about dashboards. It was about making sense of a company&rsquo;s data, connecting the source data into something a human can understand, efficiently reusing existing metrics, and governing definitions.</p>
<p>The dashboard was just the visible surface. What survives the hype cycle, from the unbundling of the modern data stack to the rise of AI agents, are the primitives: <strong>metrics, semantics, ownership, trust</strong>.</p>
<p>The AI era doesn&rsquo;t kill that need. Agents hallucinate without a strong foundational semantic layer or verified human constraints. Non-deterministic chat interfaces collapse without business-wide, agreed-upon definitions. The maintenance problem doesn&rsquo;t disappear either when you generate faster. It compounds the problems and bottlenecks for senior engineers at a company.</p>
<p>BI-as-Code, versioned dashboards, and a governed interface aren&rsquo;t opposite to the AI future, but a necessary foundation that makes working with it easier, not only for AI systems but also for humans in the loop.</p>
<p>&ndash;</p>
<p>If you enjoyed this, there&rsquo;s further related reads that might be interesting to you:</p>
<ul>
<li><a href="/blog/agentic-data-modeling/" rel="">Data Modeling for the Agentic Era: Semantics, Speed, and Stewardship</a></li>
<li><a href="/blog/self-service-bi-ai/" rel="">Has Self-Serve BI Finally Arrived Thanks to AI?</a></li>
<li><a href="/blog/bi-as-code-and-genbi/" rel="">BI-as-Code and the New Era of GenBI</a></li>
</ul>
<hr>
<pre class=""><em>Full article published at <a href="https://www.rilldata.com/blog/ai-reveals-why-bi-still-matters-hint-its-not-dashboards" target="_blank" rel="noopener noreferrer">Rilldata.com</a> - written as part of <a href="/services">my services</a></em></pre>
]]></description>
</item>
<item>
    <title>Specs Over Vibes: Consistent AI Results ft. Mark Freeman</title>
    <link>https://www.ssp.sh/blog/specs-over-vibes-interview-mark-freeman/</link>
    <pubDate>Wed, 08 Apr 2026 00:08:08 &#43;0200</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/specs-over-vibes-interview-mark-freeman/</guid><enclosure url="https://www.ssp.sh/blog/specs-over-vibes-interview-mark-freeman/featured-image.png" type="image/png" length="0" /><description><![CDATA[<p>There&rsquo;s so much going on in the AI space, and how to work with AI agents is changing every day. Everyone is overwhelmed and almost numb from so many possibilities, yet you need to find a way to work with AI, not to get left behind, right?</p>
<p>You might use AI agents all day long, parallelizing them with AI orchestrators like Agent Teams, Gastown, tmux, git worktree, and AI-based IDEs, but in the end, you just coordinated an AI. You still have to learn what it created, understand it, check for hallucinations, and verify that it built the right thing. We&rsquo;ve all become senior reviewers, more exhausted than before, with less of the work that made this fun in the first place. Meanwhile, we are also more distracted than ever. No time to think, with Copilot, Grammarly, or something else constantly asking and suggesting.</p>
<p>This series interviews real practitioners to give you the best tips on how they use AI in their data work today, extracting as many patterns behind them as possible. The article is structured in four parts: <strong>(1)</strong> how Mark is using AI, <strong>(2)</strong> what he has learned working with it, <strong>(3)</strong> what he is specifically using it for, and <strong>(4)</strong> what he thinks about AI in general and the future.</p>
<h2 id="introducing-the-guest-1-mark-freeman">Introducing the Guest: #1 Mark Freeman</h2>
<p>The start of this series is none other than <a href="https://www.linkedin.com/in/mafreeman2/" target="_blank" rel="noopener noreffer">Mark Freeman</a>. He is currently the Head of DevRel, Employee 1 and GTM at Gable. Mark has gone through three career roles as clinical researcher, data scientist, and data engineer, which is helping him greatly in his current position to navigate the unknown of generative AI. We&rsquo;ll go more into it later.</p>
<p>Mark has also co-authored a book with O&rsquo;Reilly about <a href="https://www.amazon.com/Data-Contracts-Developing-Production-Grade-Pipelines/dp/109815763X" target="_blank" rel="noopener noreffer">Data Contracts</a> (with Chad Sanderson and B.E. Schmidt), and is helping build Gable with the best possible data flows and data quality for enterprises.</p>
<blockquote>
<p>[!abstract] TLDR</p>
<p>To set the stage, in this interview we talk about how to use Spec-Driven Development workflow with Claude Code and agent teams to produce high-quality, reproducible outcomes. We cover Mark&rsquo;s use of ExcaliDraw diagrams and JSON schemas to define requirements upfront, how he parallelizes agents with tmux to compare outputs, why AI benefits senior engineers more than juniors, and where he sees data engineering heading.</p>
</blockquote>
<h2 id="how-marks-using-ai">How Mark&rsquo;s Using AI</h2>
<p>Let&rsquo;s start with the general setup Mark uses when working with AI, and how he uses generative AI.</p>
<h3 id="how-mark-changed-his-ai-workflows">How Mark Changed His AI Workflows</h3>
<p>I asked him: &ldquo;Since you&rsquo;re building a company in the data contract and quality space and have written a book, how has working with AI changed how you use AI at work?&rdquo;</p>
<p>Mark has been in the data space since 2018 as a clinical research analyst and a data scientist since 2019. In 2022 he shifted over to data engineering, and in 2023 joined Gable to solve the problem of applying data contracts. He was very early in NLP with the <a href="https://web.archive.org/web/20211024133146/https://humu.com/blog/gain-clarity-and-context-about-what-matters-most-for-your-teams" target="_blank" rel="noopener noreffer">major ML project</a> he worked on back in 2021.</p>
<p>He remembers the early days in 2023 when ChatGPT hallucinated and when he used generative AI for the first time. Very much as a chat window <em>co-coding companion</em>, asking them architecture questions and general questions about the code at hand. Fast forward to <strong>2024 and 2025</strong>, generating more code, but not full programs and projects, but <em>by function</em> - trying to narrow down the scope.</p>
<p>And then in late 2025, <strong>Claude Code</strong> came around, and <em>changed the game</em> with better models that could autonomously solve problems for a longer period. And at the same time, everyone provided more CLIs to empower the CLI-first workflow of Claude. Mark started building by giving it instructions, pointers to docs, schema, etc., and letting it independently build data-related work and go fully agentic.</p>
<h3 id="marks-spec-driven-workflow">Mark&rsquo;s Spec Driven Workflow</h3>
<p>Mark has figured out a very well-working approach that helps him create reproducible outcomes. Not focusing on solutions, but on how the tool works as he relentlessly specs and defines what he wants with the <a href="https://martinfowler.com/articles/exploring-gen-ai/sdd-3-tools.html" target="_blank" rel="noopener noreffer">Spec Driven Development (SDD)</a> approach, inspired by <a href="https://substack.com/home/post/p-187866704" target="_blank" rel="noopener noreffer">Esco Obong</a> and how he used it at Airbnb. He uses the GitHub-provided <a href="https://github.com/github/spec-kit" target="_blank" rel="noopener noreffer">spec-kit</a>, which is a toolkit to help you get started with Spec-Driven Development.</p>
<p>I hadn&rsquo;t heard of it, and when checking it out, it&rsquo;s super well documented and integrates 1:1 into Claude Code (and many other AI agents), meaning you can use slash commands within Claude and define specs with the help of an existing git repo including docs and code such as:</p>
<ul>
<li><code>/speckit.plan</code>: Execute the implementation planning workflow using the plan template to generate design artifacts.</li>
<li><code>/speckit.tasks</code>: Generate an actionable, dependency-ordered tasks.md for the feature based on available design artifacts.</li>
<li><code>/speckit.specify</code>: Create or update the feature specification from a natural language feature description.</li>
<li><code>/speckit.analyze</code>: Perform a non-destructive cross-artifact consistency and quality analysis across spec.md, plan.md, and tasks.md after task generation.</li>
<li><code>/speckit.clarify</code>: Identify underspecified areas in the current feature spec by asking up to 5 highly targeted clarification questions and encoding answers back into the spec.</li>
<li><code>/speckit.checklist</code>: Generate a custom checklist for the current feature based on user requirements.</li>
</ul>
<p>You can define these on a per-project basis, or have some of them defined as a general spec in your <code>~/.claude/</code> folder. The outcomes are Markdown files that hold dedicated specifications, based on your goals that can then be further edited and updated based on your iterations.</p>
<h3 id="working-product-focused">Working Product-Focused</h3>
<p>This helps Mark to focus on product scenarios and <strong>predictable outcomes</strong> instead of vibe coding every piece from describing his principles from scratch, he continues.</p>
<p>He goes from ideation through specs to dedicated tasks. He likes to always start with an <a href="https://excalidraw.com/" target="_blank" rel="noopener noreffer">ExcaliDraw</a> diagram, defining more of the flow diagram, rather than architecture or other overviews. For data schema and interface definitions, he defines data structure next to the relevant flow diagram, as <a href="https://blog.mehdio.com/i/160121474/best-human-feedback-loop-with-excalidraw-and-cursor" target="_blank" rel="noopener noreffer">ExcaliDraw is JSON</a>, these can be easily integrated. Schema definitions describe accurately what&rsquo;s needed based on stakeholder discussions and his needed requirements.</p>
<p>He then passes that diagram back to Claude Code and iterates on the data model and his key assumptions. Mark takes a lot of time in this process. He will spend hours, days or even weeks in this stage, updating and refining these specs, specifically giving clear and exact information about data schema, tools to use, and architectural choices that he knows as a senior engineer that he wants and needs to have.</p>
<p>This is also where years of experience make the difference.</p>
<p>






</p>
<h3 id="using-typescript-for-data-schema-enforcement">Using TypeScript for Data Schema Enforcement</h3>
<p>An interesting discovery Mark made is that he started using a programming language new to him, TypeScript. Similar to Wes McKinney&rsquo;s <a href="https://wesmckinney.com/blog/agent-ergonomics/" target="_blank" rel="noopener noreffer">From Human Ergonomics to Agent Ergonomics</a>, where he states that &ldquo;Python Was Built for Humans, Not Agents&rdquo; and argues that he is using GoLang and Rust for agent work, as it&rsquo;s a better language for agents with minimal dependencies and shorter/better types.</p>
<p>Mark ended up using lots of TypeScript, not because he was familiar with the language, but because it&rsquo;s mostly what his work and that of a data engineer requires: <strong>defining data types</strong>. Enforcing them, quickly verifying across the data pipeline that we don&rsquo;t get an error before pipeline runtime. Saving a lot of time and upping the quality.</p>
<h2 id="what-mark-has-learned-working-with-ai">What Mark Has Learned Working with AI</h2>
<p>Over the years, Mark has changed his workflow. In this part, he shows how he uses agentic agents with tmux and how he reviews and checks the outcome.</p>
<h3 id="agent-parallelization-and-executing-them-teams-and-tmux">Agent Parallelization and Executing Them: Teams and Tmux</h3>
<p>After all the specs and focusing on them once, he uses agents to implement the specs and Claude uses the feature called <strong><a href="https://code.claude.com/docs/en/agent-teams#orchestrate-teams-of-claude-code-sessions" target="_blank" rel="noopener noreffer">Agent Teams</a></strong> (which can be activated in Claude <code>settings.json</code> with <code>CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS</code>).</p>
<p>The cool thing about agent teams is that they let you coordinate multiple Claude Code instances working together. One session acts as the team lead, coordinating work, assigning tasks, and synthesizing results. Teammates work independently, each in its own context window, and communicate directly with each other.</p>
<p>Mark spawns multiple agents using iTerm2 and tmux, which I heavily recommend for agent work (also check <a href="https://zellij.dev/" target="_blank" rel="noopener noreffer">Zellij</a> for an easier start), and the agent teams feature will automatically open the additional terminals in separate panes:</p>













  


























  
<figure>
<a target="_blank" href="/blog/specs-over-vibes-interview-mark-freeman/claude-tmux-teams.png" title="Example from X">

</a><figcaption class="image-caption">Example from <a href="https://x.com/nummanali/status/2031477259689754734" target="_blank" rel="noopener noreffer">X</a></figcaption>
</figure>
<p>It shows Claude self-orchestrating his own team. Think of it as similar to <a href="https://github.com/steveyegge/gastown" target="_blank" rel="noopener noreffer">Gastown</a>, <a href="https://github.com/preset-io/agor" target="_blank" rel="noopener noreffer">Agor</a>, and other <a href="https://www.ssp.sh/brain/ai-orchestrators/" target="_blank" rel="noopener noreffer">AI orchestrators</a>, but integrated into Claude.</p>
<p>Mark&rsquo;s workflow with agent teams is deliberately outcome-focused rather than code-focused. Once the agents complete their run, he checks the result against the original specs and JSON schemas, not the code itself. The only thing that matters is whether the outcome does what was defined.</p>
<h3 id="is-reviewing-code-still-needed">Is Reviewing Code Still Needed?</h3>
<p>The tough question was whether Mark still reviews code, especially when Claude can generate more of it in a minute than we can ever review.  Mark said: &ldquo;<em>Not locally or on unimportant projects where I&rsquo;m exploring the limits and potential traps of these powerful tools.</em>&rdquo;</p>
<p>But for production pipelines or when customers asked him specifically for his opinion, he said:</p>
<blockquote>
<p>Along with the wider industry, we are figuring out how to use AI safely at scale.</p>
</blockquote>
<p>Also at work when they have mission-critical services such as in a bank, you can&rsquo;t just vibe code something. It <strong>comes down to use-cases</strong>, he said.</p>
<p>Besides use cases, he tried different ways of reviewing. First he tried a sophisticated process where the above agents would create PRs and he would then comment on these with improvements and changes. The agents would then read them and integrate the given feedback and continue the process. But even that workflow made him too much of a bottleneck. It wasn&rsquo;t scalable enough.</p>
<p>Mark searched for other ways to work with it.</p>
<h4 id="outcome-driven-reviews-and-starting-from-scratch-again">Outcome-Driven Reviews: And Starting from Scratch Again</h4>
<p>What he does now is assess outcomes instead. After all the rigorous time in speccing, he tests the result by running the pipeline, creating tests, or checking the code manually the old-fashioned way.</p>
<p>The key mindset shift here is that the first build is deliberately treated as throwaway. It&rsquo;s requirements exploring via building. You implement the spec once, learn what you got wrong, and expect to discard it.</p>
<p>That&rsquo;s why he tests the outcome. And once tested, he might have gotten new learnings that he could have only gotten through implementing or with actual tests. That&rsquo;s when he will feed these learnings back to the specs and update initial requirements, and <strong>start all over again</strong>, from scratch, letting the agent create a new outcome based on the updated specs. The cycle is: <code>spec → build → assess → improve spec/assumptions → repeat</code>.</p>
<p>






</p>
<p>This way, he has an approach with a very deep and exact iteration, almost deterministic, where he can re-run the agents with updated feedback and requirements, and get the same or similar outcome with the added updates, because of the spec-driven approach and the structured approach that <em>spec-kit</em> delivers, and the dedicated way he defines his requirements, which won&rsquo;t just be hallucinated as different inputs, end-to-end.</p>
<p>Though this can always happen, this approach served him very well, with a high-quality output he can trust, and a qualitative way to <strong>approach a complex problem</strong> with the help of agents.</p>
<p>If the outcome meets the quality he expected and it does what he wants, he goes to internal stakeholders to get feedback from them. And then the same process again, updating specs, fixing requirements errors or possible wrong assumptions, and off the agents go again.</p>
<h4 id="tests-and-quality-gates">Tests and Quality Gates</h4>
<p>Tests and QA he writes manually. This is another way to make sure the outcome meets his expectations. Most important is the value, he says:</p>
<blockquote>
<p>Value first, then outcome and then worry about other things</p>
</blockquote>
<p>If it&rsquo;s not turning out to be valuable to the stakeholders, he wants to avoid spending more time. That&rsquo;s why the agent iterations and building something &ldquo;quickly&rdquo;, with rigorous specs and definitions in place, worked well for him so far.</p>
<h3 id="senior-vs-junior-working-with-ai">Senior vs. Junior: Working with AI</h3>
<p>We move on to an interesting discussion of whether AI helps senior engineers or juniors more. Mark says (he also <a href="https://www.linkedin.com/posts/mafreeman2_the-main-reason-ai-agents-help-senior-developers-activity-7437907260837777408-dMk5?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAABkA2pgBYM4xDO0z2ChYuxFhBfu4h7jp4Lo" target="_blank" rel="noopener noreffer">wrote</a> about it) that <strong>AI helps more senior engineers</strong>, as seniors &ldquo;<em>understand the trade-offs of tech debt</em>&rdquo;.</p>
<p>He says further that in AI iterations, we move much faster, generating legacy code and architecture constructs in days and weeks, instead of years. If Mark iterates with the spec-driven design explained above, there are multiple different architectures generated, some of which might have been bad from the very beginning.</p>
<p>As a senior, he thinks that we can give the right guidance from the very beginning and exclude bad outcomes and early &ldquo;legacy code&rdquo;. No doubt, there will be code and architecture to be adapted, too, but if you <strong>lack experience</strong>, you basically have <strong>no chance of knowing</strong>.</p>
<h4 id="framework-and-architectures-are-for-the-experienced">Framework and Architectures Are for the Experienced</h4>
<p>Mark mentions that at Gable, he is building something from scratch. Let&rsquo;s say we are at iteration v4: deep technical architectures are coming up, to choose an Apache Kafka infrastructure, define your schema in JSON or Avro, or use Parquet.</p>
<p>These decisions can only be made with experience. Sure, agents will give you a good middle ground, and with research they will potentially choose the right solution for the current problem. But how do you know what&rsquo;s the <strong>best solution for your given business problem</strong>? If you have built multiple data platforms and have seen many companies, you just know some of these things or developed an intuition for what&rsquo;s needed.</p>
<p>In combination with the agents, it&rsquo;s just a much better tool for seniors than for juniors who need to more or less blindly trust the assessments the agents made. The quality of outcome depends on frameworks and architectural choices, accumulating legacy code early if a big architectural component is chosen wrong.</p>
<p>In a related but further way, the knowledge is like a linter in an editor that knows things ahead of runtime. It can detect wrong choices directly.</p>
<h2 id="what-mark-is-using-ai-for">What Mark is Using AI for</h2>
<p>Besides the already discussed use cases of general workflow and reviewing outcomes, I asked him about how he uses AI at work, working with data contracts and the non-deterministic outcome of AI, for example.</p>
<h3 id="integrating-ai-into-data-contracts">Integrating AI into Data Contracts</h3>
<p>As an author of a book on data contracts, and working in the business, one of Mark&rsquo;s priorities is to somewhat safely use AI agents to either verify contracts or help define them, if in any way possible.</p>
<p>As data contracts are written definitions between two parties, mostly written in YAML or JSON, it&rsquo;s a good medium to iterate on, where agents, humans, and all stakeholders can work on specs that can be versioned. Mark says his focus is on <strong><a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents" target="_blank" rel="noopener noreffer">evals</a></strong>, specifically for assessing how well an agent completes a specific task, built around Gable&rsquo;s products or internal workflows.</p>
<p>The main goal of evals is to more <strong>confidently</strong> know that what AI shipped is any good. Similar to stewardship in Master Data Management (MDM), where humans in the process verified if the data quality was met, with AI generation we need a similar process at a faster pace.</p>
<p>That&rsquo;s also where he draws on his clinical background with an outcome-driven approach, comparing 200 observations from end-to-end coding agent simulations and assessing results against defined criteria. At Gable, they create a <em>Code Graph</em> that helps them get a skeleton view of the <strong>full data flow in code</strong>, without running any code. Connections, context, and business operations are expressed as code to be verified.</p>
<p>His hypothesis is that with agents at scale, we can gather datasets of behaviors such as logs of data pipelines, network logs, and other information such as <a href="https://objectways.com/blog/understanding-how-ai-agent-trajectories-guide-agent-evaluation/" target="_blank" rel="noopener noreffer">agent trajectories</a> and check based on them whether the data pipeline is compliant, like <a href="https://www.parloa.com/labs/research/ai-agent-testing/" target="_blank" rel="noopener noreffer">A/B testing AI Agents with a Bayesian Model</a>. This has yet to be proven, but the hypothesis is strong.</p>
<h3 id="deterministic-and-non-deterministic-work-in-data-engineering">Deterministic and Non-deterministic Work in Data Engineering</h3>
<p>When asked about his thoughts on functional data engineering where usually jobs are reproducible and restartable with new logic/source data, and how he sees the <strong>determinism</strong> with AI work (which has a different outcome every time), he said something interesting.</p>
<p>He said <strong>non-determinism is a benefit</strong>. That&rsquo;s why the setup is specs written in markdown, combined with configs and JSON that are specific, providing precision and accuracy. If anything goes wrong or not according to plan in the generation phase, he can just change the specs and <strong>achieve this determinism</strong> by spec-driven development.</p>
<p>But there are still some disadvantages from running non-deterministically, that&rsquo;s why he still does tests and comparisons manually, and checks visually whether everything works when running the pipeline.</p>
<h2 id="what-mark-thinks-about-ai">What Mark Thinks about AI</h2>
<p>When talking about the future, learning with AI or where it leads, or also when not to use AI, is what we discuss here.</p>
<h3 id="when-not-to-use-ai">When <em>not</em> to Use AI</h3>
<p>Starting with when he is <em><strong>not</strong></em> using AI, and when it&rsquo;s potentially cheaper or better to do it manually, his answer was:</p>
<blockquote>
<p>Requirements finding in an important project, again depends on use cases. For small non-personal projects, not a problem. But requirements need to be defined by stakeholders and come from a real problem</p>
</blockquote>
<p>Also, Mark mentioned key decisions for infrastructure code that needs to be <strong>stable and reliable</strong>. Or if used, he will spend much more time validating that LLM suggestions are correct.</p>
<p>For content online, he noticed that the writing always comes off differently than he would have phrased it. He might give it his insights to check or get feedback, but not the actual writing part.</p>
<h3 id="how-do-you-see-learning-with-ai">How Do You See Learning with AI?</h3>
<p>There&rsquo;s also the danger of not learning new things, and getting overwhelmed with constant stimulation, potentially getting slightly addicted. I asked Mark if he sees a problem in using agents and LLMs that would prevent us from learning new things as we are just cruising on auto-pilot.</p>
<p>Yes, he agreed. He calls it: &ldquo;<em>Claude code slot machine</em>&rdquo;, or &ldquo;<em>Lab rat</em>&rdquo;. &ldquo;<em>Getting your dopamine hit beyond usefulness</em>&rdquo; is how he would phrase it. He also thinks that this addictive behavior doesn&rsquo;t exist randomly. He thinks it is intended for us, the users, to use and spend more tokens (ergo money for them).</p>
<blockquote>
<p>[!note] Pseudo Work</p>
<p>Shipping lots of code with AI can feel like deep work, but if you&rsquo;re not learning in the process, it&rsquo;s pseudo work. <a href="https://www.ft.com/content/a8016c64-63b7-458b-a371-e0e1c54a13fc" target="_blank" rel="noopener noreffer">Problem-solving skills in adults are already declining</a>, and even studies showing short-term learning gains with AI find that <a href="https://www.nature.com/articles/s41599-025-04787-y" target="_blank" rel="noopener noreffer">beyond 8 weeks, the effect reverses as over-reliance sets in</a>.</p>
</blockquote>
<h3 id="the-future-of-cloud-vs-local-model">The Future of Cloud vs. Local Model</h3>
<p>My closing question was where things are heading, and whether self-healing data pipelines would be a thing. When some <a href="https://substack.com/home/post/p-189793289" target="_blank" rel="noopener noreffer">say</a> that &ldquo;Unironically, Rick Rubin is the future of work&rdquo; (where AI replaced a team of analysts, a strategist, a designer, a project manager, and a few weeks of work in minutes), the same goes for data analytics and data engineering.</p>
<p>Mark explains that when he was a data scientist, getting a nice histogram in Matplotlib or Seaborn took hours. Today he gets that for free, as an afterthought. He has built applications that pull leads from Hubspot, extend and aggregate data through RAG using APIs and pipeline logs, and for a board meeting just generate a static HTML page (with an export to CSV 😉). A <strong>custom-made visualization at your fingertips. That&rsquo;s the future</strong>, he says. Because below the visualization, there&rsquo;s a <strong>semantic model</strong> as the base. No one wants to open one more app, so based on well-defined semantics, AI can one-shot the visualization and integrate into existing workflows.</p>
<p>On the local model side, another future he sees (and is exploring himself) involves models running on a dedicated machine while he&rsquo;s away. He said the future is not about how powerful the models are, but <strong>how many iterations</strong> your spec has gone through. You <strong>run them until they are correct</strong>. You can also use RAG techniques to augment the model with your own notes and <a href="https://code.claude.com/docs/en/skills" target="_blank" rel="noopener noreffer">skills</a>, one local model custom-made for you:</p>
<blockquote>
<p><strong>You can&rsquo;t compete on compute</strong>, but you can use the factor of time, iterating multiple versions for a specified problem, and choosing the best one. Exactly what clinical research is doing and what he learned in his early career comparing studies.</p>
</blockquote>
<p>An interesting bleeding-edge area is running agents optimized for <strong>concurrency</strong>, chunking tasks and parallelizing them with smaller compute resources instead of one big model. <a href="https://www.linkedin.com/in/goabiaryan/" target="_blank" rel="noopener noreffer">Abi Aryan</a> is doing GPU research in exactly that field, and Mark recommends starting with this <a href="https://www.linkedin.com/posts/goabiaryan_%F0%9D%90%88%F0%9D%90%AD-%F0%9D%90%9A%F0%9D%90%A7%F0%9D%90%A7%F0%9D%90%A8%F0%9D%90%B2%F0%9D%90%AC-%F0%9D%90%A6%F0%9D%90%9E-%F0%9D%90%AD%F0%9D%90%A8-%F0%9D%90%A7%F0%9D%90%A8-%F0%9D%90%9E%F0%9D%90%A7%F0%9D%90%9D-%F0%9D%90%B0%F0%9D%90%A1%F0%9D%90%9E%F0%9D%90%A7-activity-7441123708452294656-AP00" target="_blank" rel="noopener noreffer">post</a>. While companies are paying 10x or more for cloud compute, local models with lots of iterations are increasingly feasible, and the economics are starting to make a strong case for them.</p>
<h2 id="next-interview">Next Interview</h2>
<p>I hope you enjoyed this interview with Mark. Huge thanks to Mark for taking the time to speak with me and for sharing his experience with all of us. Follow him on <a href="https://www.linkedin.com/in/mafreeman2/" target="_blank" rel="noopener noreffer">LinkedIn</a> and his <a href="https://www.linkedin.com/learning/instructors/mark-freeman" target="_blank" rel="noopener noreffer">Course on data quality</a> and check out his <a href="https://www.amazon.com/Data-Contracts-Developing-Production-Grade-Pipelines/dp/109815763X" target="_blank" rel="noopener noreffer">book</a>, its <a href="https://github.com/data-contract-book/chapter-7-implementing-data-contracts" target="_blank" rel="noopener noreffer">repo</a>, and much <a href="https://shift-left.gable.ai/m/mark-landing" target="_blank" rel="noopener noreffer">more</a>.</p>
<p>There are three more interviews already lined up with great guests, so please share feedback, questions you might want to ask or just your experience on how to work with AI in the data space. We&rsquo;re all in this together, figuring it all out. The more we can learn from each other, what&rsquo;s important, and maybe also what&rsquo;s not, the better.</p>
<p>So stay tuned for the next interview.</p>
<hr>
<pre class=""><em>Full article published at <a href="https://motherduck.com/blog/specs-over-vibes-consistent-ai-results/" target="_blank" rel="noopener noreferrer">MotherDuck.com</a> - written as part of <a href="/services">my services</a></em></pre>
]]></description>
</item>
<item>
    <title>Building an Agent-Friendly, Local-First Analytics Stack with MotherDuck and Rill</title>
    <link>https://www.ssp.sh/blog/agentic-friendly-local-first-analytics-stack/</link>
    <pubDate>Tue, 07 Apr 2026 08:41:06 &#43;0200</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/agentic-friendly-local-first-analytics-stack/</guid><enclosure url="https://www.ssp.sh/blog/agentic-friendly-local-first-analytics-stack/featured-image.png" type="image/png" length="0" /><description><![CDATA[<p>Imagine going from a 100-million-row dataset to an interactive analytics app with just a few prompts. What used to take hours or days can now be done in minutes by combining local-first databases and BI tools with an agentic coding workflow.</p>
<p>When Rill bet on YAML dashboards and CLI-first workflows in 2022, they weren&rsquo;t thinking about AI agents. Neither was MotherDuck when they built serverless DuckDB around the thesis that most data fits on a laptop. But it turns out, what is developer-friendly is also agent-friendly, with the needs of readable code, fast engines, and deterministic semantics.</p>
<p>Times are shifting rapidly toward CLI-first development. You know that&rsquo;s true when even email and calendar get their own Google CLIs. So why not have CLIs for your business metrics too?</p>
<p>This is what Rill and MotherDuck provide, including excellent developer workflows with a local-and-CLI-first approach, focusing on a developer-friendly interface and empowering users. Both work great on local laptops but can easily scale to the cloud, backed by a serverless data warehouse..</p>
<p>The convergence of embedded analytics engines (DuckDB/MotherDuck), declarative BI-as-code (Rill), and AI agent protocols (MCP) is creating a new architecture for business intelligence, one where dashboards become code, code becomes agent-readable, and analysts shift from clicking to prompting. And with 75% of cloud data warehouse queries scanning less than 1 GB<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>, this opportunity is great for agentic BI. In this article, we look at how we build agentic-friendly and local-first analytic stacks with MotherDuck, Rill, and agents.</p>
<blockquote>
<p>[!note] End to end examples later in the article</p>
<p>Later in the article we go through three different examples of how this can work, including GitHub repos and code examples, if that is something that&rsquo;s of interest to you.</p>
</blockquote>
<h2 id="why-these-two-tools">Why These Two Tools?</h2>
<p>Let&rsquo;s start with why do we use MotherDuck and Rill for <strong>agentic-first data tasks</strong>? As <a href="https://www.linkedin.com/in/cg1507/" target="_blank" rel="noopener noreffer">Ghanshyam Chodavadiya</a> from <a href="https://swym.ai/" target="_blank" rel="noopener noreffer">SWYM</a> says:</p>
<blockquote>
<p>[!quote] Quote on why SWYM use Rill with MotherDuck for their AI-native media decision platform:</p>
<p>[..] Rill lets us <strong>encode business context</strong> directly into our BI layer. Combined with MotherDuck and the Rill MCP client, it gives us <strong>flexible data control</strong> while powering automatically generated client dashboards and <strong>AI-driven insights</strong>.</p>
</blockquote>
<p>Both MotherDuck and Rill use a sophisticated architecture that focuses on developer workflows and scales from local development with declarative configuration to cloud (e.g. with <code>rill deploy</code>, or <code>md:</code> instead of <code>md</code>) or even embeds into your data CI/CD or agents pipeline, very easily. All of these reasons make them suitable for modern data requirements, where we need to iterate quickly but still have a strong foundation.</p>
<h3 id="local-first-approach-with-duckdb">Local-first Approach with DuckDB</h3>
<p>Both tools start from a local-first approach with DuckDB as the foundation.</p>
<p>For example, the <a href="https://www.inkandswitch.com/essay/local-first/" target="_blank" rel="noopener noreffer">Local-First principle</a>, that was tackled by Ink &amp; Switch and its community compares different strengths of local workflows meant for files, but also applies to data workloads. Even more so with AI agents, which can read the context from these projects and enhance easily with the use of strong CLIs that are available on the command line.</p>
<p>Or reading the <a href="https://motherduck.com/blog/small-data-manifesto/" target="_blank" rel="noopener noreffer">Small Data Manifesto</a> by MotherDuck that says &ldquo;<strong>Think small, develop locally, ship joyfully</strong>&rdquo;. If you enjoy some of these principles, this data stack with DuckDB/MotherDuck as the warehouse or <a href="https://www.rilldata.com/blog/scaling-beyond-postgres-how-to-choose-a-real-time-analytical-database" target="_blank" rel="noopener noreffer">real-time analytics</a> storage and Rill as an interactive, fast, and beautiful BI tool will suit you well.</p>
<p>MotherDuck works seamlessly through the DuckDB CLI, whether it is to connect through their serverless database in the cloud (connect with <code>duckdb ':md'</code>) or to open a fully fledged notebook environment locally with <code>duckdb --ui</code> (<a href="https://duckdb.org/docs/stable/core_extensions/ui" target="_blank" rel="noopener noreffer">try it</a>).</p>
<p>With Rill&rsquo;s YAML-based dashboard and metrics layer, and a powerful CLI, you can transform any of your data into a blazingly fast dashboard locally (run <code>rill start</code>), or from anywhere with the data on MotherDuck. Let&rsquo;s explore both in more detail, and show how users use it, and provide an example for you to get your hands on.</p>
<h2 id="what-is-motherduck">What Is MotherDuck?</h2>
<p>Before we go into the hands-on examples, let&rsquo;s answer the question of what MotherDuck and Rill are. And what&rsquo;s the difference from DuckDB and what do they bring to the table?</p>
<p>In its essence, it&rsquo;s a DuckDB-powered cloud data warehouse that scales to terabytes with ease. Just as Turso hosts SQLite, MotherDuck hosts DuckDB in the cloud, serverless for you. MotherDuck has <a href="https://youtu.be/xxCn7uhdDzw?si=RcBpqRAzZq0jiVHD&amp;t=215" target="_blank" rel="noopener noreffer">great relation</a> to DuckDB Labs, the company behind DuckDB and the <a href="https://duckdb.org/foundation/" target="_blank" rel="noopener noreffer">DuckDB Foundation</a>.</p>
<p>MotherDuck integrates well with DuckDB, but you can also just run DuckDB locally without it and manage your server yourself, open some ports, make it scale automatically if more queries come. But you&rsquo;d need to create an orchestration that scales out, handles OOM, servers, etc. So instead, MotherDuck provides all of it with simply pointing your local database to connect via <code>ATTACH 'md:'</code> compared to directly reading from a local DuckDB database (<code>duckdb path/to/file/db.duckdb</code>) or parquet files (<code>FROM nyc.parquet</code> or <code>FROM read_parquet('test.not-ending-with-parquet')</code>) that only you have access to.</p>
<p>The simplest way is to initially upload data to MotherDuck <strong>once, and then have access to the data from anywhere</strong> (see example later in the article).</p>
<blockquote>
<p>[!example] How to Upload data to MotherDuck from Local Storage</p>
<p>You can basically use different ways of synchronizing local DuckDB to MotherDuck via <code>COPY FROM DATABASE</code>, <code>CREATE OR REPLACE DATABASE ... FROM '&lt;path&gt;'</code>, from Parquet files, to using Python files - you can find more details at <a href="https://github.com/sspaeti/sync-duckdb-to-motherduck" target="_blank" rel="noopener noreffer">sync-duckdb-to-motherduck</a>.</p>
</blockquote>
<p>Besides simply replacing local DuckDB with a data warehouse like MotherDuck, MotherDuck has implemented a specific architecture called <a href="https://motherduck.com/docs/key-tasks/running-hybrid-queries/" target="_blank" rel="noopener noreffer">dual execution</a>. It&rsquo;s built on top of their 1.5-Tier Architecture. It&rsquo;s a novel 1.5-tier architecture powered by <a href="https://duckdb.org/docs/stable/clients/wasm/overview" target="_blank" rel="noopener noreffer">WebAssembly (Wasm)</a>. Unlike the more traditional 3-Tier architecture that operates between the client and server, the 1.5-tier directly returns the request in the client (browser), reducing latency for server requests and network round trips.</p>













  


























  
<figure>
<a target="_blank" href="/blog/agentic-friendly-local-first-analytics-stack/motherduck-architecture.png" title="Image from MotherDuck: For App Developers">

</a><figcaption class="image-caption">Image from <a href="https://motherduck.com/product/app-developers" target="_blank" rel="noopener noreffer">MotherDuck: For App Developers</a></figcaption>
</figure>
<p>Traditional applications are built on a 3-Tier Architecture, which requires several intermediary operations to run between the end user interface, server, and underlying database. MotherDuck’s <a href="https://motherduck.com/product/app-developers/#architecture" target="_blank" rel="noopener noreffer">1.5-tier architecture</a> has the same DuckDB engine running inside the user’s web browser and in the cloud.</p>
<p>The developers can move data closer to the user to create analytics experiences that run <a href="https://motherduck.com/blog/introducing-instant-sql/" target="_blank" rel="noopener noreffer">instantly</a> with the benefit of still scaling with MotherDuck as the backend. Check their CIDR paper on <a href="https://www.cidrdb.org/cidr2024/papers/p46-atwal.pdf" target="_blank" rel="noopener noreffer">DuckDB in the cloud and in the client</a>, on how this works in detail.</p>
<h3 id="what-does-the-dual-execution-do">What Does the Dual Execution Do?</h3>
<p>Since the initial paper, the dual execution has evolved and makes MotherDuck more than just &ldquo;DuckDB in the cloud&rdquo;. When you <code>ATTACH 'md:'</code> locally or in the web, you get a two-node distributed system that automatically routes query stages to wherever they run best.</p>
<p>An example with using dbt: In your <code>sources.yml</code> of dbt you can simply define <code>dev</code> with DuckDB and <code>prod</code> with MotherDuck like this:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">your-project</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nt">target</span><span class="p">:</span><span class="w"> </span><span class="l">prod</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nt">outputs</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="nt">dev</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">			</span><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l">duckdb</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">			</span><span class="nt">schema</span><span class="p">:</span><span class="w"> </span><span class="l">project_dev</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">			</span><span class="nt">path</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;path/locally.duckdb&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">			</span><span class="nt">thread</span><span class="p">:</span><span class="w"> </span><span class="m">1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="l">...</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="nt">prod</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">			</span><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l">duckdb</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">			</span><span class="nt">schema</span><span class="p">:</span><span class="w"> </span><span class="l">project</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">			</span><span class="nt">path</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;md:prod_project&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">			</span><span class="nt">thread</span><span class="p">:</span><span class="w"> </span><span class="m">1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="l">...</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>Smaller data gets processed locally with millisecond response times and if needed, can extend to run in the cloud, using cross-environment joins that transfer only the necessary intermediate results. As a user, you won&rsquo;t notice the difference in using simple DuckDB.</p>
<p>This uses your laptop&rsquo;s power with DuckDB as a first-class compute node. As MotherDuck CEO Jordan Tigani put it: &ldquo;Laptops these days are extremely powerful and you can get answers in a handful of milliseconds, whereas if you had to ask a cloud service, the initial request wouldn&rsquo;t have even gotten there.&rdquo;</p>
<p>In a way, MotherDuck is a lightweight alternative to Spark for single-node or moderately-sized analytical workloads (it does not support distributed, multi-node processing on massive datasets), but it&rsquo;s far easier to set up, has no cluster management, and scales to terabytes. Without the setup cost or the operational burden for tasks that don&rsquo;t need the massive scale Spark provides, you get an out-of-the-box data warehouse that handles scale very conveniently for us users.</p>
<blockquote>
<p>[!example] One more advantage is multi-user collaboration</p>
<p>DuckDB is single-writer, and MotherDuck is what unlocks the &ldquo;multiplayer&rdquo; angle with its integrated notebooks (which are also available locally with <code>duckdb --ui</code>, but not shareable on the web).</p>
</blockquote>
<h2 id="why-rill">Why Rill?</h2>
<p>So what does Rill bring to the table, and why do they work so well with MotherDuck?</p>
<p>If MotherDuck gives you a data warehouse that feels like a local database, Rill <strong>gives you a BI tool that feels like a code editor</strong> getting you up and running with a single binary you can start with. The name Rill is from an old English word and meaning &ldquo;stream&rdquo;, and it has strong templates integrated that scaffold any BI requirements in seconds. Both MotherDuck and Rill are built on the conviction that focusing on developer experience will help data teams implement great data solutions that not only work, but are fun.</p>
<p>Rill&rsquo;s core idea is simple: define your entire BI stack (data sources, SQL models, metrics (<code>total_revenue: sum(amount)</code>), dashboards) as YAML and SQL files in a Git repository. Start simply with one command (<code>curl https://rill.sh | sh</code>), run <code>rill start</code>, and you have a local development environment backed by an embedded DuckDB instance delivering sub-second queries. Push to Git, deploy to Rill Cloud, and your dashboards are live. The same declarative files, the same SQL, the same metrics, just a different runtime.</p>
<p><div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/t7Igf3JTflc?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>
<br>
<em>A quick video of showcasing Rill&rsquo;s BI-as-Code power and how Claude Code can be used to collaborate with your BI tools, easily.</em></p>
<p>Moreover, this &ldquo;BI-as-Code&rdquo; approach turns out to be exactly what makes Rill a natural companion for <a href="https://www.rilldata.com/blog/data-modeling-for-the-agentic-era-semantics-speed-and-stewardship" target="_blank" rel="noopener noreffer">agentic workflows</a>, because all artifacts for data and BI are defined declaratively and locally, any agent can use them for context and build autonomously on top, while still letting the user verify that everything is correct and works by quickly running the Rill CLI locally or in a CI/CD pipeline.</p>
<p>Rill embeds DuckDB under the hood. Connecting it to MotherDuck requires nothing more than a YAML connector config pointing at <code>md:my_database</code> with a token property such as `token: &ldquo;{{ .env.CONNECTOR_MOTHERDUCK_TOKEN }}&rdquo;.</p>
<p>The SQL models are identical, meaning no syntax changes, no migration, no new query dialect to learn. The only thing that changes is where the data lives, but that has no effect on the user experience.</p>
<p>Rill was built around the idea of:</p>
<blockquote>
<p>That instead of hiding business logic in a proprietary GUI that only humans can click through, you make it readable code that anyone, or anything, can openly read and reason about.</p>
</blockquote>
<p>Rill with its YAML dashboards and a CLI-first workflow has positioned itself perfectly for working with AI agents. That wasn&rsquo;t planned or foreseen, but it turns out that context is king and that tools designed for developer simplicity are exactly what agents need with <strong>readable definitions and fast deterministic engines</strong>.</p>
<p>The result is a stack where your metrics are a source of truth you can version, audit, and feed directly into an agent&rsquo;s context window, and where switching from a local DuckDB file to a serverless cloud warehouse is a one-line change.</p>
<h3 id="metrics-and-sql">Metrics and SQL</h3>
<p>We all probably agree that metrics are the key to codify business knowledge, giving us all the benefits of a software design approach (versioned, automatable, testable, etc.) and automating metrics with an agent, while still being able to define complex metrics ourselves if needed. And an <strong>agentic-friendly environment</strong> where agents get their concise context and collaborate with a domain expert and human in the loop.</p>
<p>It&rsquo;s good to know that Rill provides an integrated <a href="https://www.rilldata.com/blog/why-you-need-a-sql-based-metrics-layer" target="_blank" rel="noopener noreffer">Metrics Layer</a>. It&rsquo;s excellent as it&rsquo;s just a YAML file too. Meaning you can integrate it into other notebooks, or data apps easily, but also just with Rill for building multiple dashboards, conversational analytics, and canvas on top of a <strong>unified metrics layer</strong>.</p>
<p>Instead of integrating complex metrics multiple times in different dashboards, we can just reference the metrics layer. Besides the metrics, we need fast, instant responses, even more so when we let agents work autonomously. We can&rsquo;t wait 5-10 minutes for a <strong>simple question</strong> until all research through the agents is done.</p>
<blockquote>
<p>[!info]  Metrics SQL: A language built on top of SQL</p>
<p>Additionally, Rill offers a dedicated query language that extends on the strength of SQL, called <a href="https://www.rilldata.com/blog/data-modeling-for-the-agentic-era-semantics-speed-and-stewardship#metrics-sql-a-sql-based-semantic-layer" target="_blank" rel="noopener noreffer">Metrics SQL</a>. It&rsquo;s a dedicated SQL dialect designed for querying data from <a href="https://docs.rilldata.com/developers/build/metrics-view/what-are-metrics-views" target="_blank" rel="noopener noreffer">Metrics Views</a>.</p>
<p>I wrote more at a <a href="https://www.rilldata.com/blog/data-modeling-for-the-agentic-era-semantics-speed-and-stewardship#metrics-sql-a-sql-based-semantic-layer" target="_blank" rel="noopener noreffer">SQL-Based Semantic Layer</a>, but it helps to simplify your SQL as it empowers the SQL language, or learn more about the philosophy from Mike, the CEO of Rill, in an <a href="https://www.youtube.com/watch?v=tEIQGgS4Zus" target="_blank" rel="noopener noreffer">insightful podcast</a> with Joe Reis where they talk about the future of dashboards and how agents and navigating the new era of BI and analytics works.</p>
</blockquote>
<h3 id="conversational-bi-rill-turns-dashboards-into-code-and-code-into-agent-readable-context">Conversational BI: Rill Turns Dashboards into Code (and Code into Agent-readable context)</h3>
<p>When looking at <a href="https://www.rilldata.com/blog/has-self-serve-bi-finally-arrived-thanks-to-ai" target="_blank" rel="noopener noreffer">Conversational BI</a> and its benefits, we can say that &ldquo;Conversations can generate code, and code generates insights&rdquo;. Code is the best abstraction, and with agents, we can easily make it available to non-programmers.</p>
<p><strong>Code as the abstraction layer</strong> in most cases. But why? Because if you create a hard-coded interface language, or an API, you can only do what you need. With code (usually Python, or SQL in this case too), we can do much more. We can use all the functions of the language versus the implemented API. It&rsquo;s easier to maintain, and also automate.</p>
<p>RudderStack <a href="https://www.rudderstack.com/blog/ai-data-infrastructure-as-code/" target="_blank" rel="noopener noreffer">reinforced</a> this narrative from the infrastructure side:</p>
<blockquote>
<p>Most of today’s commercial data tools are designed for humans, not for automation. Their primary interfaces are web dashboards, which are convenient for analysts, but opaque to code.</p>
</blockquote>
<p>This means if we want agent tools to analyze our code base, we need to let them access our code, or in the case of Rill, declaratively defined dashboards, metrics in the metrics layer, and data sources.</p>
<p>We can do even more when we use the chat interface to interact, making it usable for humans again, making it usable for humans again by using <strong>natural language as the primary interface</strong>. That&rsquo;s where Rill offers an extensive integration with agent workflows through generating a dashboard based on existing sources, models, and defined metrics in the metrics layer:</p>













  
<figure><a target="_blank" href="/blog/agentic-friendly-local-first-analytics-stack/nyc-trips-ai.webp" title="">

</a><figcaption class="image-caption">Image of how easy it is to generate a dashboard based on an existing model. See also more at <a href="https://www.rilldata.com/blog/bi-as-code-and-the-new-era-of-genbi" target="_blank" rel="noopener noreffer">BI-as-Code and the New Era of GenBI</a></figcaption>
</figure>
<p>These features are also integrated as <strong>Conversational BI</strong>, letting you explore your business numbers with the interface of a natural language and chat. With Cursor and agentic code-like suggestions, but referring to pre-defined metrics for asking specific questions:</p>













  
<figure><a target="_blank" href="/blog/agentic-friendly-local-first-analytics-stack/conversational-bi.webp" title="">

</a><figcaption class="image-caption">Showcase of Conversational BI in Rill</figcaption>
</figure>
<p>Here&rsquo;s a full video of what you can do with it.</p>
<p>The chat interface provides charts that can be further explored or integrated into your dashboards. What I like most is that in the responses, you can just click on them (e.g. 1.) and it will appear as a pivot table in (2.) where you can dig into more details by adding more dimensions and metrics:</p>













  
<figure><a target="_blank" href="/blog/agentic-friendly-local-first-analytics-stack/rill-conversational-charts.webp" title="">

</a><figcaption class="image-caption">Showcase of Conversational BI in Rill</figcaption>
</figure>
<p>Showcase of how interactive dashboards are created on the fly, to explore and open in a pivot table directly inside Rill. The links are clickable, see a short video of how that looks in action:<br>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/3RXwd-1o66Q?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>
</p>
<blockquote>
<p>[!note] You can also use Claude integration with MCP</p>
<p>Rill&rsquo;s MCP can also be used, see <a href="https://youtu.be/ZmgVkKImxs8?si=ECh72nAtg1LcF6uy" target="_blank" rel="noopener noreffer">Showcase of Rill MCP with Claude Desktop - YouTube</a>. Or Cursor if that is your preferred AI-based IDE at <a href="https://youtu.be/Th5Krj14DCI?si=fE04Q7F_1pbCd_AM" target="_blank" rel="noopener noreffer">BI-as-Code and the New Era of GenBI Demo - YouTube</a>.</p>
</blockquote>
<p>Let&rsquo;s now look at different analytics use cases and implementations with these handy features combined with MotherDuck as the backend.</p>
<h2 id="motherduck--rill-in-action-three-examples">MotherDuck + Rill in Action: Three Examples</h2>
<p>Let&rsquo;s now look at actual implementations with these features combined. We&rsquo;ll start with how to connect Rill to MotherDuck, then walk through two open-source examples you can try yourself, and finish with a real-world customer showcase.</p>
<h3 id="connecting-rill-to-motherduck">Connecting Rill to MotherDuck</h3>
<p>Since Rill <a href="https://docs.rilldata.com/developers/build/connectors/olap/motherduck" target="_blank" rel="noopener noreffer">already embeds</a> DuckDB, connecting to MotherDuck requires only a four-line YAML connector config with a token and a <code>md:</code> path. Add a <code>motherduck.yaml</code> to your <code>connectors/</code> folder:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l">connector</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">driver</span><span class="p">:</span><span class="w"> </span><span class="l">duckdb</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">mode</span><span class="p">:</span><span class="w"> </span><span class="l">readwrite</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">token</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;{{ .env.CONNECTOR_MOTHERDUCK_TOKEN }}&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">path</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;md:my_database&#34;</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>Compare that to a local DuckDB connector:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l">connector</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">driver</span><span class="p">:</span><span class="w"> </span><span class="l">duckdb</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">dsn</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;my_database.duckdb&#34;</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>The only difference is <code>token</code> + <code>path: &quot;md:...&quot;</code> instead of <code>dsn</code>. Set the token as an environment variable (see <a href="https://docs.rilldata.com/developers/build/connectors/olap/motherduck" target="_blank" rel="noopener noreffer">Rill docs</a> for details), and your SQL models, metrics, and dashboards work identically — whether the data lives on your laptop or in MotherDuck&rsquo;s cloud.</p>
<p>In 2025, Rill <a href="https://www.rilldata.com/blog/rill-in-review-top-features-that-shaped-2025" target="_blank" rel="noopener noreffer">significantly strengthened its native connectivity</a> to enable zero-copy, blazingly fast analytics without moving data, making this connection even more seamless.</p>
<h3 id="stack-overflow-developer-survey-zero-pipeline-analytics">Stack Overflow Developer Survey: Zero-Pipeline Analytics</h3>
<p>The simplest way to experience MotherDuck + Rill is with data you already have. Every free MotherDuck account ships with <code>sample_data.stackoverflow_survey.survey_results</code><sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>  — 600k+ professional developer responses from 2019-2024. No ETL needed.</p>
<p>The <a href="https://github.com/sspaeti/motherduck-rill" target="_blank" rel="noopener noreffer">motherduck-rill</a> project builds a complete analytics stack on this data: 4 SQL models (staging, technology usage, developer profiles, database analysis), 3 metrics views with 17+ measures, and 3 canvas dashboards — all as pure SQL + YAML in a Git repo. No Python, no orchestrator, no data pipeline.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">git clone https://github.com/sspaeti/motherduck-rill.git <span class="o">&amp;&amp;</span> <span class="nb">cd</span> motherduck-rill
</span></span><span class="line"><span class="cl">cp .env.example .env
</span></span><span class="line"><span class="cl"><span class="c1"># Add your MotherDuck token to .env</span>
</span></span><span class="line"><span class="cl">rill start
</span></span><span class="line"><span class="cl"><span class="c1"># Open http://localhost:9009</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>That&rsquo;s it. One command and you&rsquo;re exploring which databases are most desired in the US, which languages pay the highest salaries, or how AI adoption shifted across years — all backed by MotherDuck&rsquo;s serverless cloud.</p>
<p>Answering the question of &ldquo;Which databases are most desired in the US according to Stack Overflow&rdquo;:<br>












<a target="_blank" href="/blog/agentic-friendly-local-first-analytics-stack/stackoverflow-dashboard.webp" title="">

</a></p>
<p>The SQL models use standard DuckDB syntax throughout. For example, the <code>database_analysis</code> model unnests semicolon-separated survey responses into one row per database per relationship type (used, admired, desired), then the metrics view aggregates them with <code>COUNT(DISTINCT ResponseId)</code>. The same SQL runs locally via embedded DuckDB or via MotherDuck — the connector config is the only difference. This is what BI-as-Code looks like in practice.</p>
<p>The above example could be created based on a couple of simple prompts, as Rill&rsquo;s definitions are all local and the data accessible through MotherDuck through DuckDB CLI.</p>
<h3 id="multi-cloud-cost-analyzer-production-grade-connector-switching">Multi-Cloud Cost Analyzer: Production-Grade Connector Switching</h3>
<p>For a more production-like setup, I use the <a href="https://github.com/ssp-data/cloud-cost-analyzer" target="_blank" rel="noopener noreffer">cloud-cost-analyzer</a> I&rsquo;ve built in a previous edition to visualize your costs from different hyperscalers with ClickHouse, and now added MotherDuck.</p>
<p>This project shows how MotherDuck fits into a real data pipeline alongside local DuckDB and ClickHouse. Same <a href="https://dlthub.com/" target="_blank" rel="noopener noreffer">dlt</a> pipelines, same Rill dashboards where I just added a different destination to dlt. Zero pipeline code changes were needed.</p>
<p>The key insight: MotherDuck uses DuckDB SQL syntax, so the SQL models share all functions with local DuckDB. Only the <code>FROM</code> clause differs — <code>read_parquet('...')</code> locally vs <code>schema.table</code> on MotherDuck. The Rill SQL models use a 3-way conditional to switch to work around these small differences between connectors:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="err">{{</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="n">eq</span><span class="w"> </span><span class="p">.</span><span class="n">env</span><span class="p">.</span><span class="n">RILL_CONNECTOR</span><span class="w"> </span><span class="s2">&#34;motherduck&#34;</span><span class="w"> </span><span class="err">}}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="k">CAST</span><span class="p">(</span><span class="n">SPLIT_PART</span><span class="p">(</span><span class="n">identity_time_interval</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;T&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="nb">DATE</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="nb">date</span><span class="p">,</span><span class="w"> </span><span class="o">*</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="n">aws_costs</span><span class="p">.</span><span class="n">cur_export_test_00001</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="err">{{</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="n">eq</span><span class="w"> </span><span class="p">.</span><span class="n">env</span><span class="p">.</span><span class="n">RILL_CONNECTOR</span><span class="w"> </span><span class="s2">&#34;clickhouse&#34;</span><span class="w"> </span><span class="err">}}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">toDate</span><span class="p">(</span><span class="n">splitByChar</span><span class="p">(</span><span class="s1">&#39;T&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">identity_time_interval</span><span class="p">)[</span><span class="mi">1</span><span class="p">])</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="nb">date</span><span class="p">,</span><span class="w"> </span><span class="o">*</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="n">aws_costs___cur_export_test_00001</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="err">{{</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="err">}}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="k">CAST</span><span class="p">(</span><span class="n">SPLIT_PART</span><span class="p">(</span><span class="n">identity_time_interval</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;T&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="nb">DATE</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="nb">date</span><span class="p">,</span><span class="w"> </span><span class="o">*</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="n">read_parquet</span><span class="p">(</span><span class="s1">&#39;data/aws_costs/cur_export_test_00001/*.parquet&#39;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="err">{{</span><span class="w"> </span><span class="k">end</span><span class="w"> </span><span class="err">}}</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>Notice how MotherDuck and local DuckDB share the exact same SQL functions (<code>CAST</code>, <code>SPLIT_PART</code>) — only the <code>FROM</code> source changes. ClickHouse needs its own dialect (<code>toDate</code>, <code>splitByChar</code>). This is the practical advantage of MotherDuck being DuckDB in the cloud: your SQL models stay the same.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">git clone git@github.com:ssp-data/cloud-cost-analyzer.git
</span></span><span class="line"><span class="cl"><span class="nb">cd</span> cloud-cost-analyzer
</span></span><span class="line"><span class="cl">uv sync
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 1. Add your MotherDuck token to .dlt/secrets.toml</span>
</span></span><span class="line"><span class="cl"><span class="c1">#    [destination.motherduck.credentials]</span>
</span></span><span class="line"><span class="cl"><span class="c1">#    password = &#34;eyJ...&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 2. Add token to viz_rill/.env</span>
</span></span><span class="line"><span class="cl"><span class="c1">#    CONNECTOR_MOTHERDUCK_TOKEN=&#34;eyJ...&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 3. Load data into MotherDuck (same pipelines, different destination)</span>
</span></span><span class="line"><span class="cl">make run-etl-motherduck
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 4. Start Rill dashboards backed by MotherDuck</span>
</span></span><span class="line"><span class="cl">make serve-motherduck
</span></span></code></pre></td></tr></table>
</div>
</div><p>Under the hood, <code>DLT_DESTINATION=motherduck</code> tells dlt to write to MotherDuck instead of local parquet files. The same data is then also visible in the <a href="https://app.motherduck.com/" target="_blank" rel="noopener noreffer">MotherDuck web UI</a> for ad-hoc querying alongside the Rill dashboards.</p>













  
<figure><a target="_blank" href="/blog/agentic-friendly-local-first-analytics-stack/motherduck-ui.webp" title="">

</a><figcaption class="image-caption">Showcase of querying same data with MotherDuck&rsquo;s Notebook UI.</figcaption>
</figure>
<p>This pattern — <code>make serve-motherduck</code> vs <code>make serve</code> vs <code>make serve-clickhouse</code> — shows what switching from local to cloud looks like in a CLI-first stack.</p>
<h3 id="driotech-agentic-analytics-in-production">Driotech: Agentic Analytics in Production</h3>
<p>Beyond my own demos, <a href="https://www.youtube.com/watch?v=i7dHS0XxW8U" target="_blank" rel="noopener noreffer">Salomon from Driotech</a> showcased how they use Rill for agentic analytics with client data using MotherDuck in combination with Airbyte, dlt, BigQuery and dbt.</p>
<p>In his webinar on empowering businesses with <strong>agentic analytics</strong>, he walked through a B2B sales use case that highlights exactly the principles we discussed above.</p>
<p><div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/i7dHS0XxW8U?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>
<br>
<em>Video walkthrough with the example of Airbyte/dlt + BigQuery/MotherDuck/dbt + Rill.</em></p>
<p>His key takeaways align perfectly with the MotherDuck + Rill thesis:</p>
<ol>
<li><strong>Metrics as the foundation</strong>: Before any AI agent can work reliably, you need clearly defined KPIs and a single source of truth. &ldquo;If you feed any AI agent with a mess, you&rsquo;re going to end up with an even bigger mess.&rdquo; → This is exactly what Rill&rsquo;s metrics layer provides: versioned, agreed-upon definitions in YAML.</li>
<li><strong>Governance as guardrails</strong>: The agent doesn&rsquo;t hallucinate on business concepts because the semantic model constrains what it can query. When asking &ldquo;what are our shipping costs?&rdquo; → The agent looks up the exact metric definition, avoiding guessing from raw data.</li>
<li><strong>From reactive to proactive</strong>: Beyond just answering questions, the demo showed creating alerts (&ldquo;if customer orders drop to zero in 14 days, email the account manager&rdquo;) and scheduled reports → Pushing insights to stakeholders automatically.</li>
<li><strong>Code-defined dashboards</strong>: Salomon explicitly called out that Rill&rsquo;s code-based approach means &ldquo;AI agents are able to build the dashboards for us&rdquo; → Because the language of AI is code, and Rill&rsquo;s dashboards are just that.</li>
</ol>
<p>The above webinar reinforces a central point we&rsquo;ve discussed above that conversational BI can be used today in Rill. When you have a solid semantic foundation (metrics in Rill), a scalable backend (MotherDuck), and a code-first workflow, the agentic layer becomes practical, trustworthy, and something you can deploy to real users today.</p>
<div class="details admonition note open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-note"></i>More Interesting Resources<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content"><ul>
<li>Upload a local DuckDB to MotherDuck in two lines: <code>ATTACH 'md:'; CREATE DATABASE my_db FROM 'local.duckdb';</code></li>
<li><a href="https://www.youtube.com/watch?v=tEIQGgS4Zus" target="_blank" rel="noopener noreffer">Dashboards vs. Agents podcast with Mike Driscoll and Joe Reis</a></li>
<li><a href="https://motherduck.com/blog/small-data-manifesto/" target="_blank" rel="noopener noreffer">The Small Data Manifesto</a> by MotherDuck</li>
<li><a href="https://youtu.be/10d8HxS4y_g?si=wYZKVTs5IMxXrxci" target="_blank" rel="noopener noreffer">Local-First Software mini-documentary</a> by CultRepo (previously Honeypot)</li>
</ul>
</div>
        </div>
    </div>
<h2 id="whats-next-dashboards-no-more">What&rsquo;s Next, Dashboards no More?</h2>
<p>Seeing these examples, and where we are heading with agentic BI, one might ask now, <strong>do we even still need dashboards?</strong> We can just ask the chatbot?</p>
<p>I agree with Mike&rsquo;s <a href="https://www.youtube.com/watch?v=tEIQGgS4Zus" target="_blank" rel="noopener noreffer">argument</a> that: &ldquo;No, well-crafted, and especially operational dashboards, they will never go away&rdquo;. Because a visualization can provide so much more <strong>condensed information</strong> in a couple of seconds that a chat never will, or would require many back-and-forth chats.</p>
<p>There&rsquo;s also a difference between &ldquo;a&rdquo; dashboard and the dashboards. E.g. more than half of dashboards are just exploratory in nature to quickly explore the business or data. Additionally, the sales dashboards that someone spent weeks or months, sometimes years for large companies, to perfect and ensure the correct numbers, deriving key decisions for the business from it or defining a sales rep&rsquo;s salary as how much they sold usually is tracked in a dashboard as well.</p>
<p>One more thing that chatbots can&rsquo;t replace is so-called <strong>&ldquo;drilling down&rdquo;</strong>, and <strong>using data as a REPL</strong> with <a href="https://docs.rilldata.com/guide/dashboards/explore/pivot" target="_blank" rel="noopener noreffer">pivot tables</a> - as <a href="https://www.rilldata.com/blog/why-pivot-tables-never-die" target="_blank" rel="noopener noreffer">they never die</a>. Quickly drilling down to the lowest details and back up to the aggregated data within seconds, simply dragging and dropping dimensions and measures around.</p>
<blockquote>
<p>[!tip] People don&rsquo;t know what they are looking for<br>
Benn Stancil <a href="https://benn.substack.com/p/which-way-from-here" target="_blank" rel="noopener noreffer">argued</a>, &ldquo;the challenge with data exploration is not that people don&rsquo;t have the ability to manipulate data; it&rsquo;s that they don&rsquo;t know what they&rsquo;re looking for.&rdquo;</p>
</blockquote>
<h3 id="natural-language-interface-convenient-but-inaccurate">Natural Language Interface: Convenient, but Inaccurate?</h3>
<p>With agents, we can use natural language as an interface to input domain knowledge, and agents will do the technical translation and implement it in such a code and declarative-first approach, where the context is clearly and distinctly defined - much more than natural language, which contains lots of nuances and ambiguity.</p>
<p>So with that approach, the agent will put it into deterministic YAML, that can then be reviewed, tested and automated against. So we move from Human to agents to context to iteration and finally, visualization:</p>
<p>






<br>
Similar to what we discussed in <a href="https://www.rilldata.com/blog/bi-as-code-and-the-new-era-of-genbi" target="_blank" rel="noopener noreffer">GenBI</a>, iterating much faster than with the traditional, non-generative way:<br>













  
<figure><a target="_blank" href="/blog/agentic-friendly-local-first-analytics-stack/genbi-workflow-prompt-generate-ship.webp" title="">

</a><figcaption class="image-caption">BI-as-Code with agents that: 1. Prompt 2. Generate 3. Ship</figcaption>
</figure></p>
<h3 id="self-serve-with-bi-as-code-providing-the-context">Self-Serve with BI-as-Code providing the <em>Context</em></h3>
<p>So what&rsquo;s next? Are we arriving at self-service BI finally? (the never-ending promise🙂).</p>
<p>Agents with natural language solve a big problem that self-serve always strived for: Giving each domain user who is less technical an edge to do self-serve themselves, and potentially even go further and fix the data by prompting the data pipeline to fix the correct timestamp, or update a dbt model with the right table source.</p>
<p>With domain experts doing more developer work with agents, we can combine domain knowledge with coding abilities of agents, bridged by natural language with <strong>BI-as-Code providing the semantic context</strong> that includes models, metrics and even dashboards in plain YAML.</p>
<p>Context is king for the near future. Everything that can be locally defined, such as Rill&rsquo;s metrics layer and dashboards, will be so much faster and better built with agents.</p>
<p>But BI is not the only domain noticing this power, <a href="https://www.linkedin.com/in/kurtbuhler/" target="_blank" rel="noopener noreffer">Kurt Buhler</a> of Tabular Editor <a href="https://tabulareditor.com/blog/ai-agents-with-command-line-tools-to-manage-semantic-models" target="_blank" rel="noopener noreffer">wrote</a>: &ldquo;CLI tools provide an alternative way to interact with software in a terminal by writing and executing commands. This command-line interface is very suitable for agents.&rdquo;</p>
<h3 id="sql-yaml-and-why-language-choices-matter">SQL, YAML: And why Language Choices Matter</h3>
<p>Also, with Rill and MotherDuck we choose <strong>SQL as our primary language</strong> and interface and <strong>YAML as structured format</strong> to store. And the language choice matters because the training data for widely adopted languages like SQL is larger, thus LLMs are better at generating BI-as-code in SQL than DAX, LookML, or some obscure language.</p>
<p>Wes McKinney even <a href="https://wesmckinney.com/blog/agent-ergonomics/" target="_blank" rel="noopener noreffer">argues</a> that AI agents are enabling him to build software in languages like Go and Swift, despite his lack of prior experience. He says that &ldquo;human ergonomics in programming languages matters much less now,&rdquo; as agents prioritize fast compile-test cycles and frictionless distribution, favoring languages like Go and Rust over Python for new systems. Interested in more, check the <a href="https://wesmckinney.com/transcripts/2026-02-10-rill-data-podcast" target="_blank" rel="noopener noreffer">conversation with Wes and Mike</a>.</p>
<h3 id="limitations">Limitations</h3>
<p>The one limitation for the future of data might be the <strong>imprecise way of natural language</strong> and how we communicate. For example: &ldquo;give me the analytics for this week?&rdquo; Did you mean &ldquo;from today until last week&rdquo;? Or &ldquo;full weeks Monday to Sunday&rdquo;? Or &ldquo;starting from midnight, or during the day&rdquo;? So many unknowns and misinterpretations possible.</p>
<p>The other is that the future of data needs to be <strong>deterministic and reproducible</strong>, to backfill faulty data, but AI agents are the opposite. And that can be challenging.</p>
<h2 id="text-based-local-first-the-architecture-agents-need">Text-Based, Local-First: The Architecture Agents Need</h2>
<p>The connectors between source and destination are getting more flexible, more fluid, self-healing. Knowledge workers might soon be able to not only figure out the problem, but also act on it directly: &ldquo;send an email to&hellip;&rdquo; to fix the problem.</p>
<p>As the examples in this article show, we&rsquo;ve come a long way with <a href="https://www.rilldata.com/blog/has-self-serve-bi-finally-arrived-thanks-to-ai" target="_blank" rel="noopener noreffer">Self-Serve BI</a>, and we might already be there. Keep in mind the <strong><a href="https://www.rilldata.com/blog/data-modeling-for-the-agentic-era-semantics-speed-and-stewardship" target="_blank" rel="noopener noreffer">three pillars: semantics, stewardship, speed</a></strong> as you work in the <a href="https://www.rilldata.com/blog/data-modeling-for-the-agentic-era-semantics-speed-and-stewardship" target="_blank" rel="noopener noreffer">Agentic Era</a>.</p>
<p>The MotherDuck + Rill story is ultimately about the data industry discovering that the tools best suited for AI agents are the same tools that respect simplicity, transparency, and developer ergonomics.</p>
<p>The &ldquo;small data&rdquo; thesis didn&rsquo;t anticipate the AI agent revolution, but it created the conditions for it: when your data fits on a laptop and your dashboards are YAML files, an AI agent can read, reason about, and act on your entire analytics stack.</p>
<p>The irony is that going back to local-first, text-based, SQL-defined analytics turns out to be the most forward-looking architecture. And dashboards become agents when they&rsquo;re written as code.</p>
<p>&ndash;</p>
<p>If any of these interest you more, also check out related articles around conversational BI, BI as code, and how AI can self-serve us in the world of BI:</p>
<ul>
<li><a href="/blog/bi-as-code-and-genbi/" rel="">BI-as-Code and the New Era of GenBI</a></li>
<li><a href="/blog/agentic-data-modeling/" rel="">Data Modeling for the Agentic Era: Semantics, Speed, and Stewardship</a></li>
<li><a href="/blog/self-service-bi-ai/" rel="">Has Self-Serve BI Finally Arrived Thanks to AI?</a></li>
</ul>
<hr>
<pre class=""><em>Full article published at <a href="https://www.rilldata.com/blog/building-an-agent-friendly-local-first-analytics-stack-with-motherduck-and-rill" target="_blank" rel="noopener noreferrer">Rilldata.com</a> - written as part of <a href="/services">my services</a></em></pre>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>See <a href="https://assets.amazon.science/7d/d6/b0e0ff5749ceb42ca6a8437038bc/why-tpc-is-not-enough-an-analysis-of-the-amazon-redshift-fleet.pdf" target="_blank" rel="noopener noreffer">Redshift Files</a>&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>If you don&rsquo;t see the sample_data database, your account may predate the sample data being added&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></description>
</item>
<item>
    <title>Why I Still Blog — and Why the Future of Blogging Is Connected</title>
    <link>https://www.ssp.sh/blog/why-i-still-blog/</link>
    <pubDate>Fri, 06 Mar 2026 20:00:17 &#43;0100</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/why-i-still-blog/</guid><enclosure url="https://www.ssp.sh/blog/why-i-still-blog/featured-image.png" type="image/png" length="0" /><description><![CDATA[<p>I&rsquo;ve been online twenty years, and blogging for ten of them. This is the story and lessons learned of blogging online for a decade. It goes beyond blogging topics and includes <a href="https://www.ssp.sh/blog/obsidian-note-taking-workflow/" target="_blank" rel="noopener noreffer">note-taking (workflow)</a>, how to write well as well as the medium in which writing works best, and also the format in which writing works long-term such as writing in open formats and methods such as vim motions to navigate and edit like a surgeon.</p>
<p>My prediction, and hope, is that the [[Future of Blogging]] is more connected. Not only one dimensional, like a single sheet of paper, but think of a maze, where you can go in, explore new things to learn.</p>
<p>This is how I built up <a href="/brain" rel="">my Second Brain</a>, and you can see the interactive graph at the end of this blog, connecting all notes and blogs that are related.</p>
<p>This article is based on a recent interview about &ldquo;<a href="https://open.substack.com/pub/writethatblog/p/simon-spati-on-technical-blogging" target="_blank" rel="noopener noreffer">Write that Blog</a>&rdquo;. This triggered me to finally write this piece after collecting 100s of notes related to writing online and blogging in my second brain.</p>
<h2 id="why-i-started-blogging-learning-oss-tools">Why I Started Blogging: Learning OSS Tools</h2>
<p>A quick note on how I got started. Mainly it was out of curiosity. As a business intelligence specialist with a Microsoft licence, I was more curious about open-source tools that had similar abilities as [[SSAS]], [[SSRS]] that were used at work, even more so, the programmatic first approach to automate things, instead of clicking myself through the UI of older GUI first approaches.</p>
<p>I had some when I lived in Copenhagen, Denmark, which I used to explore and document what I learned. As I already had a domain (sspaeti.com) and experience in web development with weekly party pictures that I ran for many years, but wasn&rsquo;t active anymore as Facebook and other portals got created, I decided to pivot it to a <strong>personal blog</strong> - which was very popular back then.</p>
<p>So I started with my WordPress blog and uploaded <a href="/blog/ssas-cubes-dynamic-generation-of-partition" rel="">some scripts</a> and learnings around Microsoft and related automation I learned, and was re-using often. Then I did a deep dive and a <a href="/blog/data-warehouse-automation-dwa/" rel="">series on data warehouse automation tools</a>, which got very good feedback after the initial blogs didn&rsquo;t go anywhere.</p>
<p>I found myself enjoying the process of distilling knowledge in a compact format, so others, and mainly myself, could learn new topics. The [[Feedback Loop]] was another amazing feeling that I didn&rsquo;t know beforehand, along the principle [[The more you share the more you get]] - as people were giving me suggestions, new ideas, sometimes criticism. But all to find even more open source tools and interesting approaches.</p>
<h3 id="what-started-as-a-hobby-turned-into-a-full-time-job-and-business">What Started as a Hobby Turned into a Full-time Job and Business</h3>
<p>Writing became one of my favorite hobbies, and I got lots of fulfillment, not the short term dopamine hit, but the long-term [[Deep Happiness]] of learning, getting appreciated by readers, and the process of turning my long taken notes into something more usable for people to share. [[Learn in Public|Learning in Public]] as some called it later on.</p>
<p>I reserved Friday nights in my favorite library in Copenhagen, bought my favorite coffee at <a href="https://espressohouse.com/en" target="_blank" rel="noopener noreffer">Espresso House</a>, mostly a nice cookie or something sweet, and then off for 2-4 hours. Sometimes nothing really good came out, it was hard. Other times I was just trying new tools like [[Dagster]], [[Delta Lake]] etc., and others I was in [[Deep Work|deep flow]] of writing, almost like trance.</p>


<div class="bluesky-embed-wrapper" style="display: flex; justify-content: center; margin: 1.5em 0;">
    <blockquote class="bluesky-embed" data-bluesky-uri="at://did:plc:edglm4muiyzty2snc55ysuqx/app.bsky.feed.post/3la3zwbcabo2e" data-bluesky-cid="bafyreihrz6eb6bnrgruie6ul6xixggfju646jx2afgsvztaqhpl4zbagnm"><p>Back where it all started 📚 #writing #blogging</p>&mdash; <a href="https://bsky.app/profile/did:plc:edglm4muiyzty2snc55ysuqx?ref_src=embed">Simon Späti 🏔️ (@ssp.sh)</a> <a href="https://bsky.app/profile/did:plc:edglm4muiyzty2snc55ysuqx/post/3la3zwbcabo2e?ref_src=embed">2021-10-25T09:43:10Z</a></blockquote><script async src="https://embed.bsky.app/static/embed.js" charset="utf-8"></script>
  </div>
  <script>
  (function() {
    function updateBlueskyTheme() {
      var isDark = document.body.getAttribute('theme') === 'dark';
      var mode = isDark ? 'dark' : 'light';
      
      document.querySelectorAll('.bluesky-embed').forEach(function(el) {
        el.setAttribute('data-bluesky-embed-color-mode', mode);
      });
      
      document.querySelectorAll('.bluesky-embed-wrapper iframe').forEach(function(iframe) {
        var src = iframe.src;
        if (src) {
          var url = new URL(src);
          if (url.searchParams.get('colorMode') !== mode) {
            url.searchParams.set('colorMode', mode);
            iframe.src = url.toString();
          }
        }
      });
    }
    
    updateBlueskyTheme();
    
    new MutationObserver(function(mutations) {
      mutations.forEach(function(m) {
        if (m.attributeName === 'theme') updateBlueskyTheme();
      });
    }).observe(document.body, { attributes: true });
  })();
  </script>
<p>The breakthrough came much later, three years in, when I wrote about a new upcoming topic, and how the transition from data warehouse I see, called <a href="https://www.ssp.sh/blog/data-engineering-the-future-of-data-warehousing/" target="_blank" rel="noopener noreffer">Data Engineering, the future of Data Warehousing?</a>. This was the first viral post, and popular figures like Dan Linstedt commented on it. It was surreal back then, why would these people read <em>my</em> article?</p>
<p>But it gave me the motivation to continue. Sure I love writing and sharing in public, but not sure if I wouldn&rsquo;t have people reading it, if I would have continued until today.</p>
<blockquote>
<p>[!note] A short journey of how my domain and website evolved</p>
<p>I started my first blog in <strong>2015</strong> - but I was online and registered a domain in <strong>2004</strong>. I bought the domain sspaeti.com where my first endeavor was web development with HTML, CSS and PHP. The classic Apache years (fun fact, I still deploy to apache server to this day, but it&rsquo;s only static HTMLs today :)</p>
<p>From <strong>2005-2014</strong> I ran a local forum and party guide (this was before FB :) and then in 2015 my first data-related post. <strong>2016-2018</strong>: Regular blogging on Business Intelligence and data topics. <strong>2019</strong> I started to focus more on open-source data engineering.<br>
2021 I moved from <a href="/blog/why-i-moved-away-from-wordpress/" rel="">WordPress to GoHugo</a> and <strong>2022</strong> I added the second brain to my website which meant all my notes and blogs were powered by Markdown which led me to share much more as it took me no conversion or work to publish anymore. What I wrote, I could just publish as is on <a href="https://www.ssp.sh/brain" target="_blank" rel="noopener noreffer">my second brain</a>. To this day, I have ~9000 private notes and ~1000 public notes. And 81 blog posts and some chapters of an early book I&rsquo;m writing in Markdown too :).</p>
<p><strong>2023</strong> I changed the domain to ssp.sh, as it&rsquo;s shorter :)</p>
</blockquote>
<h3 id="how-did-i-manage-to-continue-to-this-day">How Did I Manage to Continue to This Day?</h3>
<p>[[Writing is hard]] as anyone will tell who does it. So why do I torture myself to do it to this day? Even made it my full time work, as I&rsquo;m currently self-employed and work as a <a href="/services" rel="">full time author</a>.</p>
<p>The answer is not straightforward, but to say the truth, I still love it to this day. Writing words is my canvas as an artist, where I can let out my thoughts, be creative, bring something complex into simple terms. Into something that anyone might want to read.</p>
<h2 id="how-has-blogging-changed-over-the-last-10-years">How Has Blogging Changed over the Last 10 Years?</h2>
<p>During my start, where I created a WordPress personal blog website, to today, there have been different evolutions, but overall, not much has changed in terms of personal blogs.</p>
<p>These are still the same, except that we changed the technology a couple of times, from using Flash websites to hand writing PHP/HTML/MySQL to using WordPress to Medium and [[Static Site Generators (SSG)]] to Substack today, the main change is social media. Before, personal blogs had more authority. Everyone was linking to other blogs, currently a couple of social media tech giants have the monopoly and you almost need to share there to be discovered.</p>
<p>When I started, I used Twitter and LinkedIn already too, but &ldquo;the game&rdquo; of distribution has changed. But again, the sole purpose of personal blogs is the same.</p>
<p>Today with AI we are even in a new era, with all the &ldquo;AI Slop&rdquo; generated and shared all over the place. I believe, and see it as my work as a professional writer, that the <a href="https://craft.ssp.sh/" target="_blank" rel="noopener noreffer">craft</a> of writing, and [[Writing Manually]], gets even more important.</p>
<p>Writing is communication, and we can&rsquo;t communicate through a filter, which at the moment many are doing with converting bullets into prose and the reader summarizes from prose to bullet points - watering down the actual points and wording the original author has made. To the point where most people, me included, [[I&rsquo;d rather Read the Prompt|would rather read the prompt]].</p>


<div class="bluesky-embed-wrapper" style="display: flex; justify-content: center; margin: 1.5em 0;">
    <blockquote class="bluesky-embed" data-bluesky-uri="at://did:plc:edglm4muiyzty2snc55ysuqx/app.bsky.feed.post/3mfqu2gurwk25" data-bluesky-cid="bafyreibjkjjo5c3b74t7jnwthblfkeculdjqfgb54ne2yycjvre7op3jai"><p lang="en">Related. marketoonist.com/2023/03/ai-w...</p>&mdash; <a href="https://bsky.app/profile/did:plc:edglm4muiyzty2snc55ysuqx?ref_src=embed">Simon Späti 🏔️ (@ssp.sh)</a> <a href="https://bsky.app/profile/did:plc:edglm4muiyzty2snc55ysuqx/post/3mfqu2gurwk25?ref_src=embed">2026-02-26T09:11:17.336Z</a></blockquote><script async src="https://embed.bsky.app/static/embed.js" charset="utf-8"></script>
  </div>
  <script>
  (function() {
    function updateBlueskyTheme() {
      var isDark = document.body.getAttribute('theme') === 'dark';
      var mode = isDark ? 'dark' : 'light';
      
      document.querySelectorAll('.bluesky-embed').forEach(function(el) {
        el.setAttribute('data-bluesky-embed-color-mode', mode);
      });
      
      document.querySelectorAll('.bluesky-embed-wrapper iframe').forEach(function(iframe) {
        var src = iframe.src;
        if (src) {
          var url = new URL(src);
          if (url.searchParams.get('colorMode') !== mode) {
            url.searchParams.set('colorMode', mode);
            iframe.src = url.toString();
          }
        }
      });
    }
    
    updateBlueskyTheme();
    
    new MutationObserver(function(mutations) {
      mutations.forEach(function(m) {
        if (m.attributeName === 'theme') updateBlueskyTheme();
      });
    }).observe(document.body, { attributes: true });
  })();
  </script>
<h2 id="blogs-vs-second-brain-notes">Blogs vs. Second Brain Notes</h2>
<p>One approach that I like to push, and many are doing locally with [[Obsidian]], is connected note taking. I shared my Obsidian notes that are worth sharing on my <a href="/brain" rel="">public second brain</a> (find my process at [[Public Second Brain with Quartz]] of adding <code>#publish</code> and it will be on my site, no conversion needed, the code and utilities are shared on <a href="https://github.com/sspaeti/second-brain-public" target="_blank" rel="noopener noreffer">GitHub</a>, too)</p>
<p>Bringing back connected personal notes, but also internally on your website - using synergies between your blog and second brain. The way I think as of now about [[Sharing as Second Brain Note vs a Blog Post]]:</p>
<blockquote>
<p>The second brain helps me to share whatever is in my mind, and the blog helps me to refine. <strong>Notes compound</strong> and always evolving. Blog posts <strong>capture a moment in time</strong>.</p>
</blockquote>
<p>There&rsquo;s also the difference between long-term, always updated, and [[compounding]] notes vs. the one time distilled blog article. They work so well together. As you might notice, most of my links in this article, with much more information, are long-term notes that I&rsquo;m collecting and refining over the years, linked to my second brain.</p>
<p>This way, I can bring all notes into one storyline, the way I&rsquo;m currently thinking, sharing it in the form of a blog, as this one, while continually updating the related notes all linked here on long-term strategy for blogging, and with its different [[Type of Notes]].</p>
<h3 id="connects-knowledge-helping-learning-the-same-way-as-our-brain-does">Connects Knowledge: Helping Learning the Same way as Our Brain Does</h3>
<p>I&rsquo;m thinking of Designing Data-Intensive Applications by Martin Kleppmann), where he added maps to his book, that correlated similar terms:<br>


<div class="bluesky-embed-wrapper" style="display: flex; justify-content: center; margin: 1.5em 0;">
    <blockquote class="bluesky-embed" data-bluesky-uri="at://did:plc:edglm4muiyzty2snc55ysuqx/app.bsky.feed.post/3mfruvqq2rc2o" data-bluesky-cid="bafyreibvvckb3lxqwtsbvaxow2h2cmmlpbodv3nyuxxcmfetn2siaw7wuq"><p lang="en">Here are some of the maps. See how Kafka is close to Kinesis. Really like them.</p>&mdash; <a href="https://bsky.app/profile/did:plc:edglm4muiyzty2snc55ysuqx?ref_src=embed">Simon Späti 🏔️ (@ssp.sh)</a> <a href="https://bsky.app/profile/did:plc:edglm4muiyzty2snc55ysuqx/post/3mfruvqq2rc2o?ref_src=embed">2026-02-26T18:59:13.373Z</a></blockquote><script async src="https://embed.bsky.app/static/embed.js" charset="utf-8"></script>
  </div>
  <script>
  (function() {
    function updateBlueskyTheme() {
      var isDark = document.body.getAttribute('theme') === 'dark';
      var mode = isDark ? 'dark' : 'light';
      
      document.querySelectorAll('.bluesky-embed').forEach(function(el) {
        el.setAttribute('data-bluesky-embed-color-mode', mode);
      });
      
      document.querySelectorAll('.bluesky-embed-wrapper iframe').forEach(function(iframe) {
        var src = iframe.src;
        if (src) {
          var url = new URL(src);
          if (url.searchParams.get('colorMode') !== mode) {
            url.searchParams.set('colorMode', mode);
            iframe.src = url.toString();
          }
        }
      });
    }
    
    updateBlueskyTheme();
    
    new MutationObserver(function(mutations) {
      mutations.forEach(function(m) {
        if (m.attributeName === 'theme') updateBlueskyTheme();
      });
    }).observe(document.body, { attributes: true });
  })();
  </script></p>
<p>I see the connected, interactive graph on my second brain, and on my book (just <a href="https://bsky.app/profile/ssp.sh/post/3mfrlc74i7s2k" target="_blank" rel="noopener noreffer">recently added</a>) the same way. It helps learning.</p>
<p>It&rsquo;s proven that we learn much better if we can associate to an existing term or something we know, versus something new that is orphaned in our brain, without a connection and synapse to another thought (or note in our case). It&rsquo;s hard to remember and learn from it.</p>
<p>In a [[Second Brain]] and [[Digital Garden]] approach, you connect every note at least to one existing term. I also like to add its <code>origin</code> so I always know where it came from. More on <a href="/blog/obsidian-note-taking-workflow/" rel="">My Obsidian Note-Taking Workflow</a> if that interests you more.</p>
<p>For example the below note about [[Functional Data Engineering]] (← click here to see the graph and backlinks in action) shows how besides the written text, you can glance connected notes through the interactive graph or through backlinks.<br>
![[img_index.en_1772811747298.webp]]</p>
<p>We can visually see things that are otherwise almost impossible to grasp or see. Like a map of a city can convey information density that no chat or explanation can do by explaining to someone on the phone or in written text. It&rsquo;s the same with the graph.</p>
<p>And the best part, it&rsquo;s additional, so you don&rsquo;t need to look at it at all. But most helpful when you want to learn or might not know the space that well yet, you can see a term or connection you know, and immediately connect your brain, that these belong together, probably remember forever, or much longer.</p>
<p>E.g. in the above example, we might see that functional data engineering is linked to clarity, and to [[No Less Code vs Code|Code Is Still the Best Abstraction]], which might be non-obvious, but really helpful to know.</p>
<p>Again, linked notes are the best way to organize knowledge, especially optimized for <strong>learning</strong>. Knowledge doesn&rsquo;t grow linearly. It expands as a network over different seasons. I write more about that phenomenon and continue to update at [[Future of Blogging]].</p>
<h3 id="the-process---and-the-difference-between-long-term-and-short-term">The Process - And the Difference between Long-Term and Short-Term</h3>
<p>My process is essentially:</p>
<ol>
<li>idea occurred by reading a book, listening to a podcast, talking to someone or else</li>
<li>writing it down in my private second brain</li>
<li>connect to existing notes, refine the note, add new thoughts and notes</li>
<li>eventually add <code>#publish</code> and publish online on my public second brain</li>
<li>eventually distill that note with many others and write a blog post about a specific and related topic of that note</li>
<li>continue updating and refining the note</li>
<li>eventually writing another blog that relates to it, using that improved note</li>
<li>using the note for my book topics</li>
<li>refining note</li>
</ol>
<p>I think you get the gist. The initial one liner, the note that just existed based on a real insight, is the actual most important information of the whole process, in my opinion.</p>
<p>Not to say that the blog articles are not, but they both need each other. With writing the blog, I distill and connect multiple notes at a current timestamp into a frozen article. During that process, notes also get updated. And while sharing the blog post, I get a lot of feedback, [[Feedback Loop]], which I then instead of adding to the blog, which I can&rsquo;t as it&rsquo;s a snapshot in time, add to the existing note.</p>
<p>So <strong>feedback is actively processed</strong>, and massaged into my private or public second brain, improving my overall approach. I think this process of connected notes from private, to public note to blog and continued note, is even more powerful than Niklas Luhmann&rsquo;s [[Zettelkasten]], that revolutionized the [[Smart Note Taking]] approach based on zettels, small unique ideas that he connected to others.</p>
<blockquote>
<p>[!info] No Research needed with this process</p>
<p>Related is also that through this process, [[Why I don&rsquo;t Research|I don&rsquo;t need to research]] in a classical sense for topics. As my life and insights come steadily in, and get massaged and integrated like a slow flowing river, all organically.</p>
</blockquote>
<h3 id="the-form-of-writing-long-form-or-short">The Form of Writing: long Form or short</h3>
<p>My writing usually tends to be very long-form. Because I take lots of notes that I try to connect and write in an interesting way, I tend to get very long. Same as this writing already is, but I still have 3000+ notes collected to go.</p>
<p>Also, long-form writing <strong>evokes a deeper relationship and trust that is hard to captivate</strong> with a couple of words. That&rsquo;s also why a [[Reading Books for a Happy Life|book]] can connect you to the author like no other medium can.</p>
<p>It&rsquo;s also a question of <strong>long term game</strong> and writing for it to stay relevant for many years to come, or just capturing a quick trend and harshly (fast?) putting out a blog. These are totally different forms of content, and strategies. The latter usually also used on social media, to create big attention to go viral for a short term.</p>
<p>But what I always ask myself, what&rsquo;s the gain from it? Ultimately, likes and followers are just a [[Vanity Metric]], and to me at least, don&rsquo;t count as much as a real human reading these words. Not leaving a like or comment, but just having made a connection or an impact on someone in another part of the world I don&rsquo;t know (yet? I&rsquo;m always happy to get introduction emails from my readers! :)). Or just inspiring or making you think about something related, or just learning something new.</p>
<p>That&rsquo;s at least my main goal. There&rsquo;s no hidden goal or message behind my writing. Obviously if I write for my clients, it&rsquo;s a bit different, as I want them to succeed, whereas I write for myself, I just want to let out my thoughts. But what I learned over the years is that [[Writing from The heart]], being genuine, is also helping for work related topics, as at the end of the day, it&rsquo;s still a human being reading it - and therefore the same principle applies as if it were just a random blog post.</p>
<h2 id="my-writing-process">My Writing Process</h2>
<p>I&rsquo;d like now to switch gears a bit as we went through the differentiation of compounding, refined notes and in-time blog posts, and talk about my writing process that I have mastered or improved over the years.</p>
<p>This is my unique writing approach and even more so, note-taking approach. This is not how you have to do it, and probably won&rsquo;t work for you. As I learned on the <a href="https://www.youtube.com/watch?v=KU5FUqbqMK0&amp;list=PLFxhXLgGkVzKCn23_g8qM19DMDgco8eNJ" target="_blank" rel="noopener noreffer">How I Write (Podcast)</a> by David Perell, each author has a totally different approach. And this here is mine.</p>
<p>But here I want to share a little bit more about my strategies, my approach to writing, and my tips and tricks I have learned and noted down over the years.</p>
<h3 id="i-spend-many-hours-weeks-and-months-on-single-blog-posts">I Spend Many Hours, Weeks and Months on Single Blog Posts</h3>
<p>But even before that, a quick prefix: I spend many hours, weeks, sometimes months on a single blog post. <a href="/blog/why-are-we-here-on-earth/" rel="">Why Are We Here on Earth?</a> for example, I wrote over the course of two years - but if you include my note taking, it sometimes is obviously even longer, because some notes of mine are ten years and older.</p>
<p>There&rsquo;s also the longer I work on it, the more learning, and sometimes struggles I can put into a piece, which helps the piece to not get outdated the next months too. Something that I&rsquo;m grappling with over months and years most probably won&rsquo;t be gone tomorrow.</p>
<p>This is one reason why I don&rsquo;t like to write too much about AI at the current pace, everything I write, and would spend lots of hours might be outdated the moment I publish.</p>
<p>That&rsquo;s why my approach is just collecting and refining my thoughts on a second brain note, for example in this case on [[Will AI replace Humans]] and related notes - which made it already on the frontpage of Hackernews - but I&rsquo;m sure at some point I will take that note and all its relevant related notes, and will distill into a single blog post. But the time hasn&rsquo;t come yet, as so much is changing.</p>
<p>But it&rsquo;s not that I don&rsquo;t do it at all, sometimes I will write something quick, something maybe less long-term, but usually, it&rsquo;s just less fun for me to write, and maybe less challenging? Although, some topics and articles that I have slept over too long just poured out of me in one go, no sophisticated linking in my second brain or other approach, just a blank page and writing it down. But usually these are topics that I have read extensively about, I&rsquo;m discussing with people or are just dear to my heart, that my subconscious is just working on it until it&rsquo;s telling me it&rsquo;s ready, and then I must not miss the opportunity and just write it down.</p>
<p>A little similar to this piece and topic. It&rsquo;s so dear to my heart, and something I wanted to write for so long, that I have never done it, and now with the &ldquo;write that blog&rdquo; interview, it triggered so many questions that I just went on and wrote until now in one flow. No breaks, just free flow and combining different notes I have in my Obsidian vault.</p>













  
<figure><a target="_blank" href="/blog/why-i-still-blog/img_index.en_1772814021534.webp" title="">

</a><figcaption class="image-caption">How my Second Brain looks like while writing this very article</figcaption>
</figure>
<p>This is how my vault and process looks right now, with:</p>
<ol>
<li>The current note</li>
<li>Is the long long outline (you can&rsquo;t even see half of it),</li>
<li>Is related notes through smart connections</li>
<li>Is the initial write that blog interview I answered</li>
<li>Is another connected note I have written just above about</li>
<li>These are more related notes</li>
<li>And you see word counts on the lower right and Vim Motions I write the article in</li>
</ol>
<blockquote>
<p>[!example] Don&rsquo;t focus too much on the numbers, but on writing<br>
With social media, you could focus too much on [[Vanity Metric]], and how many likes you get on social. But I try not to give too much about it, though it&rsquo;s still needed. I wrote more about <a href="/blog/well-being-algorithms/" rel="">Well Being in Times of Algorithms</a>, my personal essay towards a better World Wide Web, and how well-being is connected to social media.</p>
</blockquote>
<h3 id="ultimate-goal-good-storytelling">Ultimate Goal: Good Storytelling</h3>
<p>If I had to summarize my writing process, or the goal of it, then it crystallized to me lately that the goal is to have the ultimate storytelling. I want to write about a topic that has an intro that catches the attention, then has a great body, and finishes with a hook and ties everything together.</p>
<p>Storytelling like in the movies, it&rsquo;s true for writing too, where they have the main act, second act, the villain etc. But with the difference, you can&rsquo;t use fancy show effects, you are left with simple words.</p>
<blockquote>
<p>[!abstract] What is Good Storytelling?</p>
<p>This obviously is very personal, and differs from person to person. To me, most of it boils down to the art of leaving things out, which I am getting much better at over time. And I think that is really what storytelling is all about.</p>
</blockquote>
<p>That&rsquo;s why it&rsquo;s very important to use what you have as a writer. In writing a blog like this, one of the most important and one I like to use most, it&rsquo;s the length of a paragraph, making them look good, break at the right timing when re-reading. End on a high, start the next that connects but with a new insight. The paragraphs should be of different lengths, they should be interesting, and change over time.</p>
<p>Mix it up with images, with some quotes or what I like a lot, [[Admonition (Call-outs)|callouts]]. The reason why I like callouts is I can add an additional story, a side note in a way to not distract from the main storyline, but I can serve some readers who like some behind the scenes or more information. Plus they look beautiful in my eyes, adding different colors to the blog post as each of my different types of callout has a different color. It makes it more interesting aesthetically, and that also helps to want to read something more in my opinion, the aesthetic can help big time.</p>
<h3 id="leave-with-a-spark-making-it-interesting">Leave with a Spark✨. Making it Interesting</h3>
<p>Besides the ultimate goal of having a good storyline, having a common thread, a nice reading flow and outline is key to keep you, the reader, engaged. I like to jump a bit around.</p>
<p>Not only cutting some topics short, moving on to something else, maybe coming back, maybe not, leaving the reader in the blank, making it more interesting. The <strong>key of good writing</strong> is leaving out what needs to be left out. It&rsquo;s an artform, because I could ramble forever on this topic, as I&rsquo;m super passionate about it - as you might have noticed - but I need to always keep in mind to not bore you. To give you new insights.</p>
<p>That&rsquo;s why I&rsquo;m switching now to making it interesting. Besides switching from topic to topic, I also like to go very deep in a topic, and then zooming out very high-level, only to go very deep again in the next sentence.</p>
<p>Zooming in and out helps the reader to not lose the overview, but also learning something new. I usually don&rsquo;t spend too much time in the middle &ldquo;zooming level&rdquo;, this section is boring to me as it&rsquo;s too vague (not detailed and concrete, and not guiding with not enough overview).</p>
<p>Switching all the time might feel a little <strong>like a rollercoaster</strong>, but rollercoasters are fun, so do I envision my articles. Writing and its process boils down to me to:</p>
<blockquote>
<p>[[Writing from The heart]] is the best. True, honest, and genuine human-to-human communication.</p>
</blockquote>
<h3 id="writing-from-curiosity">Writing from Curiosity</h3>
<p>My best writing comes from my own curiosity. I want to answer a question for myself, even better if I don&rsquo;t know the answer beforehand. Not knowing where I&rsquo;m heading to.</p>
<p>I usually set a title, and then go with the flow, see where it leads me—these are the best writings of mine. If I have to write about a certain topic, or outline, it couldn&rsquo;t be more boring, and that&rsquo;s usually reflected in my writing too.</p>
<p>The exception is if I know the space very well, I can write a leadership thought piece, bringing together 20 years into one blog. The challenge is again in nailing the storytelling part to make twenty years coherent from start to end.</p>
<h3 id="my-writing-style">My Writing Style</h3>
<p>Finding your writing style and writing voice is something that was very hard for me. But I think is key to become a writer, especially a professional writer.</p>
<p>It takes time. What helped me for sure, was to read many books, finding my favorite authors and identify their writing style. This helped me to find that I liked <a href="https://sive.rs/" target="_blank" rel="noopener noreffer">Derek Sivers</a> books. Initially, I didn&rsquo;t know why, until I found more about his writing style, his personality in interviews, read more from him, and analyzed his work.</p>
<p>I found that his minimalistic style, to scrap each unneeded word, straight to the point and providing value while inspiring with different takes that you haven&rsquo;t heard already a hundred times. He writes <strong>genuinely</strong>, he also [[Journaling|journals]] 3-4 hours almost daily, thinking and brainstorming a lot in his head and second brain.</p>
<p>That&rsquo;s also what led me to journal and write in my second brain, like a physician, experiment with different formulas and ideas, to see what comes out. That&rsquo;s my <strong>second brain idea creation</strong>. I have the two phases, the idea creation and finishing part. Again, the second brain is where I start with a one-liner, a note from a friend, listing interesting things, linking to existing notes.</p>
<p>Later when distilling into a blog post, or sharing a public second brain note, I will tackle the deeper meaning, the connection with other ideas and areas of my life or things I&rsquo;m currently learning or have learned a long time ago.</p>
<p>I try not to force it. How many times have I tried to force it, only to go to bed early, and the next day wake up and just have it flow out of my fingers.</p>
<h3 id="writing-is-rewriting">Writing is Rewriting</h3>
<p>Most of it is also just <strong>rewriting</strong>. When you write something 3-4 times, when you sleep over it, your subconscious has worked on it while you walk, it always gets better.</p>
<p>It&rsquo;s a way of <strong>personal editing process</strong>. It&rsquo;s also a way of writing style. Jason Fried and Haruki Murakami from <a href="https://www.goodreads.com/book/show/143361343-novelist-as-a-vocation" target="_blank" rel="noopener noreffer">Novelist as a Vocation</a> (an amazing book for writers), are constantly re-writing. Sometimes based on feedback of readers, sometimes based on a [[Gut Feeling]].</p>
<blockquote>
<p>[!example] A short story from the book Novelist as a Vocation<br>
Haruki Murakami writes in his book that once he lost a manuscript of a chapter. He was devastated, but had no other choice but to rewrite it.</p>
<p>Years later he found the manuscript again and was afraid it would be better than what he had handed in for his book that was already published. But the fear was all wrong, it was so much worse, he writes.</p>
</blockquote>
<h4 id="finding-your-voice-but-how">Finding Your Voice, but How?</h4>
<p>So, how do you find your voice?</p>
<p>After all the different ways I wrote, I read from other authors, there is no one way, and I can&rsquo;t tell you how yours will be, other than you start writing, and trusting in the process.</p>
<p>For me, my writing voice I defined as written in personal first person voice. Something I have experienced I can easily explain or write about. But making up a fake story, something they let you do in school, is something I was never good at.</p>
<p>I try to be <strong>authentic, friendly and succinct,</strong> with the goal of adding value to the readers.<br>
I&rsquo;m trying to give clues and tools concrete and extremely specific but also leave things out. Because I&rsquo;m not an academic and I can only make suggestions about what I learned but cannot solve all the problems I write about. I also try not to overly copy others&rsquo; ideas, but make them my own through connecting and my own unique life experience. Including hardship and daily struggle and just life.</p>
<p>Key is also that you share in public. Keep writing until you find your voice. From there on, it will be much easier.</p>
<h3 id="how-i-found-my-writing-voice-through-the-english-language">How I Found My Writing Voice: Through the English Language</h3>
<p>I did live abroad for almost three years in a foreign country, learning Danish, but even more so English. And what happened there is what I would never have predicted.</p>
<p>The more I learned the language of English (I wasn&rsquo;t fluent before), I started to read more books. I noticed that there are so many more books that weren&rsquo;t available before, when I only read in German.</p>
<p>Also, I found out that I really liked the English language, the simpler grammar compared to High German as we like to call it in Switzerland. I found that I can express myself much better, more precisely as English has so many words for almost the same meaning, so you can choose and pick one that exactly describes what you want to say. Whereas in German I felt I always need to write a full novel to explain a simple thing very specifically. This might be good for fiction, but not for my technical writing, or also what I write here.</p>
<p>All of a sudden I was reading books all my free time, listening to <a href="https://tim.blog/podcast/" target="_blank" rel="noopener noreffer">Tim Ferriss</a> and all his guests on his podcasts, and learning every day. Also, that was the time when I went to the library in Copenhagen and started writing, in English, a secondary language I was just about to learn properly (I had English in school before) and could converse and have small talk. I wrote more about my journey and about <a href="/blog/finding-my-pathless-path/" rel="">Finding My Pathless Path</a> if you are curious to know more about that.</p>
<h4 id="simple-english">Simple English</h4>
<p>As you know now, and might get from my grammatical errors here and there, my English isn&rsquo;t my mother tongue.</p>
<p>For a long time I saw that as a disadvantage, but lately I figured that it might even be a strength of my writing. With my somewhat limited English vocabulary and language skills, leading to my articles and writing being much <strong>simpler English</strong>.</p>
<p>And one thing I learned over the years, the easier you can explain complex topics, and make it approachable, the easier for my reader to follow along. Also it makes the writer more approachable, less &ldquo;snobbish&rdquo; maybe?</p>
<p>And this is an advantage. My writing is much more approachable this way. It adds a natural constraint to my writing and makes my writing process potentially easier, that I am not even aware of during the writing, but helps me in a certain way I do write.</p>
<h2 id="the-writers-toolkit-and-the-tools-i-use">The Writer&rsquo;s Toolkit: And the Tools I Use</h2>
<p>Let&rsquo;s come to the last bigger part of this already long article, the tools and methods I use.</p>
<p>My main tool is writing in an open format, that is just [[Markdown]] and then using [[Obsidian]] as the editor to connect these simple notes together in a meaningful way.</p>













  
<figure><a target="_blank" href="/blog/why-i-still-blog/img_Todays%20Daily%20Graphs%20-%20Obsidian%20Graph_1771602123200.webp" title="">

</a><figcaption class="image-caption">My latest Obsidian graph with <code>9057</code> notes. Follow along with more on this <a href="https://x.com/sspaeti/status/2024872752100913586" target="_blank" rel="noopener noreffer">Tweet</a></figcaption>
</figure>
<h3 id="the-different-modes-of-my-writings-with-vim-motions">The Different Modes of My Writings with &lsquo;Vim Motions&rsquo;</h3>
<p>Apart from that, the way I write is with something called [[Vim Language (and Motions)|Vim Motions]]. I have written extensively about it, and you might think these matter not so much.</p>
<p>I hope, if you write online, or program for a living, that you learned touch typing at some point in your life. If you have, you&rsquo;d agree that it tremendously helped you with everything working on the computer, right? Not needing to see each key before you press.</p>
<p>Vim motions go a step further, essentially making each key on your keyboard a tool. In the default mode, when opening vim, each key press is doing a function. E.g. <code>g</code> is for jumping around (<code>gg</code> is jumping to the top of the document, <code>ctrl + o</code> jumps back where you left before. <code>G</code> jumps to the end of the document. <code>$</code> jumps to the end of a line. And so on, I could go on forever, but if you want to actually write something, you&rsquo;d need to switch to &ldquo;insert-mode&rdquo; with <code>i</code> (there&rsquo;s also <code>a</code> for append or <code>o</code>), but you have different modes. I wrote in [[Four Modes of Writing]] I have four modes with vim motions at all times:</p>
<ol>
<li>NORMAL mode: jumping around, reading, learning</li>
<li>INSERT mode: writing, thinking, making connections</li>
<li>VISUAL mode: copying, highlighting, format, designing</li>
<li>COMMAND mode: automate, fix, macros.</li>
</ol>
<p>These vim motions, which are different from [[vim]] or [[Neovim]], the editor, are also available in Obsidian and almost any editor you know. Even Gmail has shortcuts like <code>j</code> to go down or <code>k</code> to go up, two common ways of navigation in vim motions. Even more, vim has a language, the <strong>vim language</strong>. This is super helpful as you don&rsquo;t need to memorize 1000s of commands by heart, but can combine them. Almost like Streetfighter where you can do a combo.</p>
<p>Besides vim motions, which I write much more on <a href="/blog/why-using-neovim-data-engineer-and-writer-2023/" rel="">Why Vim Is More than Just an Editor</a>, you can edit at the precision of a surgeon. If interested, also check out my video on [[Vim Motions for Writers]], where I made a timelapse of how that looks:</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/6kaOcYg0io8?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<blockquote>
<p>[!tip] A Trick you can use:  <a href="/brain/writing-within-the-app-vs.-a-note-app" rel="">Writing within the App vs. a Note app</a></p>
<p>Sometimes it&rsquo;s hard to write in your notes app or offline, but if you send an email to a friend, or write in a LinkedIn post or in your blog editor, the pressure is on. You know it&rsquo;s going to go live, or it&rsquo;s for a certain friend. This can help you unblock [[Writing is Hard|writer&rsquo;s block]] or produce better quality.</p>
</blockquote>
<h2 id="the-medium">The Medium</h2>
<p>Different mediums to publish, to write on, to take notes are out there, when to use which?</p>
<h3 id="the-medium-to-publish">The Medium to Publish</h3>
<p>Which medium, which platform do you use? Nowadays you have many options, Substack, Medium, Ghost or other [[Open Subscription Platforms]].</p>
<p>I would always recommend having your own domain. If you like web design and tinkering a bit, even more now with [[Vibe Code Agents|AI Agents Tools]], you should start with a [[Static Site Generators (SSG)]].</p>
<p>These allow you to use Markdown as the format, and owning your content, not losing backlinks when switching platform, and building the domain ranking over time, leading to more authority when searching for a job, or your own business or side projects. Or also just a hobby where you share learnings and topics of interest to you.</p>
<blockquote>
<p>[!example] Medium in Taking Notes<br>
I wrote much more on [[Digital vs Paper]], where I explain that I use both - and also share examples of how that process from physical paper notes to digital notes can look like.</p>
</blockquote>
<h3 id="tools-laptops-distraction-free-typewriter-unitaskers">Tools: Laptops, (distraction-free) Typewriter, Unitaskers!</h3>
<p>An important part is also to make it fun! I do that with different devices. Recently I even used the old typewriter of my grandfather. It showed me the power of [[Uni-taskers]].</p>
<p>The typewriter can only write, not like a laptop where you can surf the internet, or play games or get distracted by social media. Just typing ahead. So refreshing.</p>
<p>That&rsquo;s where I went down the rabbit hole of [[Distract-Free Typewriter|Distraction-Free Typewriter]] and bought myself a small <a href="https://github.com/unkyulee/micro-journal" target="_blank" rel="noopener noreffer">Micro Journal</a>. A digital device solely for typing. Obviously it can connect to the internet and could do much more - as I installed Linux on it - but the resources are so limited that already running [[Neovim]] that I remodeled to a <a href="https://wp.ssp.sh" target="_blank" rel="noopener noreffer">Wordprocessor</a> struggles to open. So there&rsquo;s no danger of doing anything else.</p>
<p>Also because of the limited screen real-estate, it lets you really focus on the writing, and less so editing an article. So it&rsquo;s really a joy to use to exercise my [[Creative Writing]] vein. Just for the joy of writing.</p>













  
<figure><a target="_blank" href="/blog/why-i-still-blog/img_My%20Typewriter%20%28Hermes%202000%29_1763202607476.webp" title="">

</a><figcaption class="image-caption">Left is the Hermes 2000 from my Grandfather, and my distraction-free typewriter (Micro Journal Rev. 2), and one of my three unique keyboards I love (<a href="brain/kinesis-advantages-2-lubing-and-dampering" rel="">Kinesis Advantage 2</a> in this case)</figcaption>
</figure>
<h3 id="the-future-proof-format-markdown">The Future-Proof Format: Markdown</h3>
<p>As mentioned already, [[Markdown]] is the format of choice for me. Especially after I was trapped in Microsoft OneNote and its proprietary format, I couldn&rsquo;t get my own notes out of it. It was super key to have something that will surpass the test of time. And there&rsquo;s nothing more than [[Plain Text Files]] with Markdown.</p>
<p>I gave a full talk about this topic at <a href="https://www.youtube.com/watch?v=BOJFHMtyqNs" target="_blank" rel="noopener noreffer">Knowledge Management in the Digital Age: From Zettelkasten to Startup Owner</a>, check that out if you want to know more why Markdown, its advantages over [[Rich Text]], and how you can build a note taking setup that works with Obsidian - and even setting the foundation for a solo business like mine.</p>
<p>Markdown has many more advantages. The format has been proven to be the best for agents. I can easily share my public second brain in Markdown with [[Quartz - Publish Obsidian Vault|Quartz]] with no conversion or manually copying notes between rich-text editors and website.</p>
<p>Markdown has all the advantages of simply writing with minimal formatting sugar. It has the advantage that the formatting lives as part of the text, which makes copy pasting not lose all the links or bold/italic etc, which we put for a reason.</p>
<p>Markdown is declarative, meaning you can automate things, you can have the same text, but different engines to present. E.g. I use [[HackMD]] for collaboration (Google Docs for Markdown), and I use Markdown to publish on my website. It&rsquo;s the same file, the same format, there is no conversion needed.</p>
<p>Compare this to your typical Google Docs, WordPress, Webflow, or other [[Open Subscription Platforms]] such as Substack and Medium. These tend to enforce constraints, you need to always copy your text back and forth, creating copies of your text, potentially losing important formats.</p>
<p>The other big advantage, Markdown files are just [[Plain Text Files|Plaintext Files]], meaning we own the files - no big tech or company can forbid access for us or take them away, they <strong>work offline</strong> when we don&rsquo;t have internet, and they are <strong>super fast</strong> as it&rsquo;s just tiny files that are locally stored, no round trips to the server.</p>
<blockquote>
<p>[!info] My Note-Taking Path from forgetting everything to Obsidian with Vim and Quartz<br>
My path so far with note-taking:</p>
<ol>
<li>Forgetting everything</li>
<li>Taking scattered and very detailed notes on multiple devices, apps, and paper</li>
<li>Improving during my studies with OneNote, where notes related to work or study go into separate notebooks (no notes for personal notes yet).</li>
<li>Starting to create a personal notebook for travels, personal research, etc.</li>
<li>There is still a lot of confusion about:<br>
1. where to store my notes<br>
2. changing of the folder structure<br>
3. finding older notes is complex and rarely happened</li>
<li>Switching to <strong>[[Obsidian]]</strong> with a new open format and a different spirit and capabilities.</li>
<li>Starting my <strong>[[Second Brain]]</strong><br>
1. Constantly updating my long-time wealth of personal knowledge by adding notes about my health, journals, cooking, books I read, and everything related to my life.<br>
2. I Started to connect notes and sophisticate my system in a way that I confidentially find it later down my life span, the moment I need it.</li>
<li>Start using [[Vim]] and, more importantly, its <strong>[[Vim Language (and Motions)|motions]]</strong> for fast and effortless note-taking.</li>
<li>Sharing them publicly with [[Quartz - Publish Obsidian Vault|Quartz]].</li>
</ol>
<p>Find the full break down on <a href="/blog/obsidian-note-taking-workflow/" rel="">My Obsidian Note-Taking Workflow</a>.</p>
</blockquote>
<h2 id="wrapping-up">Wrapping up</h2>
<p>I wanted to write much more about the <strong>art of writing</strong>, how to [[Writing|write]], [[How to Write Well]], and generally more the technique and art of writing. But as this article is already very long, I will save that for another long, distilled blog post in the future. You can follow the above links already, where I wrote a lot, but not in this distilled blog format.</p>
<p>Maybe one day I will write a book about it, I have so much more to say and tell 🙂. What do you think, would you read it? 🙃</p>
<p>Now I want to leave you with some <strong>unexpected impacts</strong> that writing had on me, and how to get started with blogging:</p>
<ul>
<li>My articles and website focus on data engineering, but my most successful posts (in views and virality) are topics about [[Obsidian]], [[Vim Motions for Writers|vim]], and philosophy</li>
<li>That was a surprise — but now I get it: these were where I had something on my heart, something I poured many years into and put into a single writing</li>
<li>Career impact: got higher salaries because I was considered known &ldquo;world-wide&rdquo; through my blog</li>
<li>People knew my writing, and typically tend to like you as you give the writing for free</li>
<li>When I started my own company, I basically didn&rsquo;t have to sell — being online for so long, people know my writing, my principles, and even my life through my [[Second Brain]]</li>
<li>The total set of articles is what matters: when people come back to you, that&rsquo;s the [[compounding]] effect</li>
</ul>
<blockquote>
<p>[!question]  The Elephant in the Room: AI<br>
You might ask, but what about AI content, isn&rsquo;t that a reason to not start to write? I&rsquo;d say no for the same reason writing was good before AI and before books, and before time.<br>
It&rsquo;s good for yourself, to calm down, let out all your thoughts. Distill, learn and also remember things that are very important to you.</p>
<p>[[Writing Manually]], as I like to call it, is also my joy of writing - if you enjoy it, I enjoy it. It&rsquo;s hard, it&rsquo;s a challenge, it&rsquo;s not easy. I get great satisfaction. Like chess, computers are much better, but we still play chess. And also for the love of words and communication, <strong>writing is communication</strong>. And to prevent more &ldquo;AI Slop&rdquo; from being created.</p>
<p>Read much more on [[Will AI replace Humans]], where I share my latest on that topic.</p>
</blockquote>
<p>I have so much more to say, which a lot of it is in my second brain, so feel free just to browse and explore my <a href="/brain" rel="">second brain</a>. Use the backlinks and the graph to explore more. I even added a <a href="https://explore.ssp.sh" target="_blank" rel="noopener noreffer">semantic search</a>, so you can find hidden connections on my public brain, on topics that might interest you, or on this very topic you just read.</p>
<p>A good book recommendation that is related (I share more on <a href="/books" rel="">Book Recommendations and Notes</a>), that helped me a ton, which I also wrote about <a href="https://pathless.ssp.sh/" target="_blank" rel="noopener noreffer">finding mine</a>, is <a href="https://www.goodreads.com/book/show/60135094-the-pathless-path" target="_blank" rel="noopener noreffer">The Pathless Path</a> by Paul Millerd. This is the Tim Ferriss 4-Hour Workweek book, updated for today, and it inspired me to take the step of going full-time as a <a href="https://www.ssp.sh/services" target="_blank" rel="noopener noreffer">freelance technical writer</a> and making writing my job.</p>
<p>What if you want to get started yourself? Read my interview on &ldquo;Write that Blog&rdquo; where I share more suggestions about starting your own blog. But generally, [[writing is hard]], just get started.</p>
<p>Have a note taking app (one only, based on an open format, available offline too), and write things down. Save important moments in your life, blessings of people telling you, insights from books you read, etc.</p>
<p>Follow the mantra of [[Learn in Public|Learning in Public]]. Use the feedback loop, let&rsquo;s share and learn together. And know that over time, all your notes and personal knowledge will compound, like money does if you invest it cleverly.</p>
<blockquote>
<p>[!note] Maybe the easiest way to get started: just an email converted to a blog</p>
<p><a href="https://www.hey.com/world/" target="_blank" rel="noopener noreffer">Hey World</a> has this feature integrated into their email — you can write an email as you would normally, just a different recipient, and if sent, it will be online as a normal blog post. A very easy way to get started.</p>
</blockquote>
<p>I&rsquo;m leaving some links to <strong>tips and tricks for unblocking</strong> and finding a good rhythm.</p>
<ul>
<li>[[Coffee Break Rhythm]]:  Using location pressure as a productivity forcing function. For me, in summer, it&rsquo;s moving every 1.5-2 hours from coffee shop to coffee shop, creating a pressure to finish up before leaving. It tricks the brain into not thinking: &ldquo;Oh, I have still all day long time&rdquo; and then procrastinate, the opposite of [[Productive Procrastination]], where we let go on purpose to get some insights we wouldn&rsquo;t have gotten otherwise.</li>
<li>Use the doubts, the signs that you can&rsquo;t write today. Take a walk instead, best would be to walk everyday. I call it the [[Productive Procrastination]]. Many see it as a bad thing, but it&rsquo;s unavoidable, and me, and also others, believe it&rsquo;s our body, gut telling us something. E.g. Tim Urban from <a href="https://waitbutwhy.com/" target="_blank" rel="noopener noreffer">Wait But Why</a> is also big on procrastination. He says the same. He hates to procrastinate sometimes, but that&rsquo;s how his brain works, and where he gets some insights he wouldn&rsquo;t have gotten without.</li>
<li>[[Ultradian Rhythm]]: Know that the first 10 minutes of a 90-minute deep work block are always going to be hard. That&rsquo;s okay. Remind yourself of this when starting is hard.</li>
<li>Embrace the <a href="https://www.ssp.sh/blog/owning-things-attention/" target="_blank" rel="noopener noreffer">New Luxury of Boredom</a>: I feel we are at a turning point. We, the people, want to own things, want distraction-free experiences, and above all, want tools that benefit us, not the pockets of large companies. There are more stories that people <a href="https://www.youtube.com/watch?v=c3oXoF9XW_Q&amp;ref=ssp.sh" target="_blank" rel="noopener noreffer">use old iPods</a> for music, buying the music. Typing on a typewriter solely for writing (like I did <a href="https://ssp.sh/brain/distract-free-typewriter/" target="_blank" rel="noopener noreffer">distraction-free typewriter</a>), <a href="https://world.hey.com/dhh/why-we-re-leaving-the-cloud-654b47e0?ref=ssp.sh" target="_blank" rel="noopener noreffer">leaving the cloud</a>, or just using <a href="https://ssp.sh/brain/local-first/" target="_blank" rel="noopener noreffer">local first</a> products like Obsidian, DuckDB. Devices that are <a href="https://ssp.sh/brain/uni-taskers/" target="_blank" rel="noopener noreffer">uni-taskers</a>, doing one thing well.</li>
</ul>
]]></description>
</item>
<item>
    <title>Git for Data Applied: Comparing Git-like Tools That Separate Metadata from Data</title>
    <link>https://www.ssp.sh/blog/git-for-data-tools/</link>
    <pubDate>Wed, 04 Mar 2026 00:08:08 &#43;0200</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/git-for-data-tools/</guid><enclosure url="https://www.ssp.sh/blog/git-for-data-tools/featured-image.png" type="image/png" length="0" /><description><![CDATA[<p>Continuing from <a href="/blog/git-for-data-theory" rel="">Part 1</a>, where we learned what git for data is, how the architecture and use cases work, how you can achieve git-like functionality with different approaches, and how the key is to avoid moving data as much as possible to keep state that can be referenced and rolled back to, but at the same time saving cost by not duplicating all data every time you create a new branch.</p>
<p>Now it&rsquo;s time to see what Git-like tools for data are out there, and how they actually work in practice. Part 2 dives into the tools and implementations. We&rsquo;ll examine LakeFS, Dolt, Nessie, MotherDuck, Bauplan, and more, exploring how they work under the hood. Each tool takes a different approach to the same fundamental challenge: enabling Git-like workflows without copying petabytes of data.</p>
<p>The key insight from Part 1 was that all these tools separate metadata from data, using techniques like copy-on-write and pointer manipulation. But the devil is in the details. Some tools version entire data lakes, others focus on databases. Some support full merge workflows, others prioritize instant forking. Understanding these trade-offs will help you choose the right solution for your stack.</p>
<p>There will be gaps, and implementations are changing fast, so take it with a grain of salt. But this should give you a good overview of what&rsquo;s out there, and help you invest more time in the ones that fit your use case best.</p>
<p>Let&rsquo;s get into it.</p>
<h2 id="git-like-tools-overview">Git-like Tools: Overview</h2>
<p>There are many tools out there, some of which have been used for years, and others are rather new. We compare them and see what each of them has to offer.</p>
<h3 id="comparison-overview">Comparison Overview</h3>
<p>The overview below serves as a summary. We will go into more detail, with each tool getting one short chapter with a showcase of features and application use cases.</p>
<table>
  <thead>
      <tr>
          <th>Tool</th>
          <th>Storage Type</th>
          <th>Primary Use Case</th>
          <th>Branching</th>
          <th>Cloning</th>
          <th>Merging</th>
          <th>Snapshot/Time Travel</th>
          <th>Rollback</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/treeverse/lakeFS" target="_blank" rel="noopener noreffer"><strong>LakeFS</strong></a></td>
          <td>Data Lake</td>
          <td>Version control for data lakes</td>
          <td>Full</td>
          <td>Via branching (zero-copy)</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td><a href="https://github.com/dolthub/dolt" target="_blank" rel="noopener noreffer"><strong>Dolt</strong></a></td>
          <td>Database (SQL)</td>
          <td>Versioned SQL database</td>
          <td>Full</td>
          <td>Yes (copy-on-write)</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td><a href="https://github.com/projectnessie/nessie" target="_blank" rel="noopener noreffer"><strong>Nessie</strong></a></td>
          <td>Data Lake</td>
          <td>Catalog-level versioning</td>
          <td>Full</td>
          <td>Yes (zero-copy)</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td><a href="https://www.bauplanlabs.com" target="_blank" rel="noopener noreffer"><strong>Bauplan</strong></a></td>
          <td>Data Lake</td>
          <td>Versioned pipelines</td>
          <td>Data-level</td>
          <td>Yes (zero-copy)</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td><a href="https://motherduck.com" target="_blank" rel="noopener noreffer"><strong>MotherDuck</strong></a></td>
          <td>Data Warehouse</td>
          <td>Serverless data warehouse</td>
          <td>No branching</td>
          <td>Zero-copy clones (differential storage)</td>
          <td>No</td>
          <td>Configurable (named snapshots indefinitely)</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td><a href="https://github.com/duckdb/ducklake" target="_blank" rel="noopener noreffer"><strong>DuckLake</strong></a></td>
          <td>Data Lake</td>
          <td>SQL-native lakehouse</td>
          <td>No</td>
          <td>Via snapshots (zero-copy)</td>
          <td>No</td>
          <td>Yes (unlimited snapshots)</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td><a href="https://github.com/neondatabase/neon" target="_blank" rel="noopener noreffer"><strong>Neon</strong></a></td>
          <td>Database (SQL)</td>
          <td>Branching SQL database</td>
          <td>Full</td>
          <td>Yes (copy-on-write)</td>
          <td>No</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
  </tbody>
</table>
<p><em>It&rsquo;s by no means complete, but it shows the most dominant players.</em></p>
<p>Further analysis of the OSS ecosystem of git for data tools and their GitHub activity tells us how healthy the repos are, as of February 2026:</p>
<table>
  <thead>
      <tr>
          <th>Tool</th>
          <th style="text-align: right">Stars</th>
          <th style="text-align: right">Forks</th>
          <th style="text-align: right">Open Issues</th>
          <th style="text-align: right">Contributors</th>
          <th>Language</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/neondatabase/neon" target="_blank" rel="noopener noreffer">Neon</a></td>
          <td style="text-align: right">21,006</td>
          <td style="text-align: right">890</td>
          <td style="text-align: right">3,040</td>
          <td style="text-align: right">159</td>
          <td>Rust</td>
      </tr>
      <tr>
          <td><a href="https://github.com/dolthub/dolt" target="_blank" rel="noopener noreffer">Dolt</a></td>
          <td style="text-align: right">19,692</td>
          <td style="text-align: right">615</td>
          <td style="text-align: right">490</td>
          <td style="text-align: right">125</td>
          <td>Go</td>
      </tr>
      <tr>
          <td><a href="https://github.com/treeverse/lakeFS" target="_blank" rel="noopener noreffer">lakeFS</a></td>
          <td style="text-align: right">5,130</td>
          <td style="text-align: right">427</td>
          <td style="text-align: right">438</td>
          <td style="text-align: right">114</td>
          <td>Go</td>
      </tr>
      <tr>
          <td><a href="https://github.com/duckdb/ducklake" target="_blank" rel="noopener noreffer">DuckLake</a></td>
          <td style="text-align: right">2,438</td>
          <td style="text-align: right">140</td>
          <td style="text-align: right">79</td>
          <td style="text-align: right">35</td>
          <td>C++</td>
      </tr>
      <tr>
          <td><a href="https://github.com/projectnessie/nessie" target="_blank" rel="noopener noreffer">Nessie</a></td>
          <td style="text-align: right">1,406</td>
          <td style="text-align: right">171</td>
          <td style="text-align: right">156</td>
          <td style="text-align: right">159</td>
          <td>Java</td>
      </tr>
  </tbody>
</table>
<p>And community responsiveness based on <a href="https://ossinsight.io" target="_blank" rel="noopener noreffer">ossinsight.io</a>, latest available month - click on link below to get a deeper insight in each repository:</p>
<table>
  <thead>
      <tr>
          <th>Tool</th>
          <th style="text-align: right">PR Merge Time (p50)</th>
          <th style="text-align: right">Issue First Response (p50)</th>
          <th style="text-align: right">Total Commits</th>
          <th style="text-align: right">Total PR Creators</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://ossinsight.io/analyze/neondatabase/neon" target="_blank" rel="noopener noreffer">Neon</a></td>
          <td style="text-align: right">-</td>
          <td style="text-align: right">-</td>
          <td style="text-align: right">71,756</td>
          <td style="text-align: right">100</td>
      </tr>
      <tr>
          <td><a href="https://ossinsight.io/analyze/dolthub/dolt" target="_blank" rel="noopener noreffer">Dolt</a></td>
          <td style="text-align: right">~0.5 hours</td>
          <td style="text-align: right">~40 hours</td>
          <td style="text-align: right">31,807</td>
          <td style="text-align: right">99</td>
      </tr>
      <tr>
          <td><a href="https://ossinsight.io/analyze/treeverse/lakeFS" target="_blank" rel="noopener noreffer">lakeFS</a></td>
          <td style="text-align: right">~6 hours</td>
          <td style="text-align: right">~23 hours</td>
          <td style="text-align: right">24,956</td>
          <td style="text-align: right">178</td>
      </tr>
      <tr>
          <td><a href="https://ossinsight.io/analyze/duckdb/ducklake" target="_blank" rel="noopener noreffer">DuckLake</a></td>
          <td style="text-align: right">~45 hours</td>
          <td style="text-align: right">~55 hours</td>
          <td style="text-align: right">351</td>
          <td style="text-align: right">27</td>
      </tr>
      <tr>
          <td><a href="https://ossinsight.io/analyze/projectnessie/nessie#overview" target="_blank" rel="noopener noreffer">Nessie</a></td>
          <td style="text-align: right">~750 hours</td>
          <td style="text-align: right">&lt;1 hour (bot-triaged)</td>
          <td style="text-align: right">13,464</td>
          <td style="text-align: right">77</td>
      </tr>
  </tbody>
</table>
<p><em>Note: All data from GitHub API, Feb 2026. Github Activity Chart. See also <a href="https://www.star-history.com/#treeverse/lakeFS&amp;dolthub/dolt&amp;projectnessie/nessie&amp;duckdb/ducklake&amp;tigrisdata/tigris&amp;neondatabase/neon&amp;type=date&amp;legend=top-left" target="_blank" rel="noopener noreffer">GitHub Star History</a></em></p>
<p>Dolt stands out with the fastest PR merge times (~30 min median). lakeFS leads in total PR creators (178), reflecting a broad contributor base. Nessie&rsquo;s near-instant issue response reflects automated triage.</p>
<blockquote>
<p>[!note] How Do They Work?</p>
<p>While Git versions code through file snapshots and diffs, data tools must handle actual data, if possible, without copying entire datasets. Each tool solves this challenge differently, but they share a common approach: <strong>separating metadata from data</strong>.</p>
<p>Instead of duplicating data, they track pointers and references, enabling instant branching/cloning and zero-copy operations.</p>
<p>




<br>
What usually happens without tools like this <a href="https://www.youtube.com/watch?v=z-ATZTUgaAo" target="_blank" rel="noopener noreffer">Testing in Production</a></p>
<p>Find more insight about the architecture and behind the scenes in Part 1, <a href="/blog/git-for-data-theory" rel="">Branch, Test, Deploy: A Git-Inspired Approach for Data</a>.</p>
</blockquote>
<h2 id="git-like-tools-break-down">Git-like Tools: Break down</h2>
<p>Let&rsquo;s get started with the tools and see their features and how they work, categorized into three categories: data lake based, transactional and relational databases, and analytical databases.</p>
<h3 id="data-lake-versioning-object-storage">Data Lake Versioning (Object Storage)</h3>
<p>Data lake versioned tools sit between the compute engine and the object storage (S3, GCS, Azure Blob), leaving you free to query with whatever engine you prefer: Trino, Spark, DuckDB, etc.</p>
<h4 id="lakefs">LakeFS</h4>
<p>LakeFS is one of the first tools to bring git-like versioning to object-storage-based data lakes. Its core approach is a metadata layer over object storage with immutable data and logical-to-physical address mapping on top of an object store such as a data lake, hence &ldquo;lake&rdquo; as part of the name.</p>
<p>It segregates data <code>data/</code> with random physical addresses from its metadata <code>_lakefs/</code>, which includes range files, meta-range files, and commit information.</p>
<p>When you upload <code>allstar_games_stats.csv</code> to branch <code>main</code>, lakeFS generates a random physical address like <code>s3://bucket/data/gp0n1l7d77pn0cke6jjg/cg6p50nd77pn0cke6jk0</code>. This ensures immutability and files are never overwritten.</p>
<p>LakeFS operates as an S3-compatible gateway, intercepting read/write operations and managing versioning transparently. Applications interact with it like normal object storage while getting full Git semantics underneath.</p>
<p>The system implements a layered architecture:</p>
<ol>
<li><strong>Graveler</strong>: Core versioning engine managing branches, commits, and merges</li>
<li><strong>Storage Adapter</strong>: Interfaces with S3/GCS/Azure</li>
<li><strong>Hooks</strong>: Pre-merge and post-commit validation</li>
</ol>













  
<figure><a target="_blank" href="/blog/git-for-data-tools/lakefs-architecture.webp" title="">

</a><figcaption class="image-caption">LakeFS <a href="https://docs.lakefs.io/latest/understand/architecture/" target="_blank" rel="noopener noreffer">Architecture</a> overview</figcaption>
</figure>
<p>Creating a branch from the CLI is as simple as this:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">lakectl branch create lakefs://quickstart/denmark-lakes --source lakefs://quickstart/main
</span></span></code></pre></td></tr></table>
</div>
</div><p>The UI supports creating pull requests, or branches, literally like GitHub but for data.<br>













  
<figure><a target="_blank" href="/blog/git-for-data-tools/lakefs-pr.webp" title="">

</a><figcaption class="image-caption">LakeFS interface, here an example of a <a href="https://docs.lakefs.io/latest/howto/pull-requests/" target="_blank" rel="noopener noreffer">Pull Requests</a></figcaption>
</figure></p>
<p>Check out their <a href="https://github.com/treeverse/lakeFS" target="_blank" rel="noopener noreffer">GitHub repo</a>, <a href="https://docs.lakefs.io/" target="_blank" rel="noopener noreffer">documentation</a>, or a practical example of <a href="https://lakefs.io/blog/write-audit-publish-with-lakefs/" target="_blank" rel="noopener noreffer">Implementing a Write-Audit-Publish (WAP) Pattern</a> for much more information.</p>
<h4 id="nessie">Nessie</h4>
<p><a href="https://github.com/projectnessie/nessie" target="_blank" rel="noopener noreffer">Nessie</a> came out of Dremio and is another early adopter that has been doing this for a long time. Its core approach is a transactional catalog with Git-like versioning for Apache Iceberg and Delta Lake tables.</p>
<p>Rather than versioning data files, Nessie versions the <strong>catalog metadata</strong>, the registry of tables and their locations.</p>
<p>This separation enables <strong>zero-copy branching</strong> where branches share table metadata pointers, <strong>multi-table transactions</strong> with atomic commits across multiple tables, and <strong>Git semantics</strong> such as branch, tag, merge, and cherry-pick operations.</p>
<p>Nessie leverages the immutability of modern table formats with Iceberg:</p>
<ol>
<li><strong>Iceberg snapshots are immutable</strong>: Each table change creates new metadata.</li>
<li><strong>Nessie tracks which snapshot</strong> each branch points to.</li>
<li><strong>Branching copies pointers</strong>, not data or metadata files.</li>
<li><strong>Merging updates pointers</strong> to replay changes from source to target.</li>
</ol>
<p>Example workflow:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span><span class="lnt">9
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Create branch</span>
</span></span><span class="line"><span class="cl"><span class="n">catalog</span><span class="o">.</span><span class="n">create_branch</span><span class="p">(</span><span class="s1">&#39;experiment&#39;</span><span class="p">,</span> <span class="s1">&#39;main&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Modify table on experiment branch</span>
</span></span><span class="line"><span class="cl"><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">&#34;INSERT INTO catalog.experiment.orders VALUES (...)&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1"># This creates new Iceberg snapshot, Nessie updates experiment pointer</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Main branch unchanged - still points to original snapshot</span>
</span></span><span class="line"><span class="cl"><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">&#34;SELECT * FROM catalog.main.orders&#34;</span><span class="p">)</span>  <span class="c1"># Original data</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Nessie runs as a REST service with pluggable backends including metadata storage such as PostgreSQL, DynamoDB, or RocksDB, data lake integration that works with any Iceberg-compatible engine (Spark, Trino, Dremio), and version control with a Git-like commit graph with branches and tags.</p>
<p>Nessie doesn&rsquo;t touch your data files. It&rsquo;s a lightweight coordination layer that brings Git semantics to your lakehouse by versioning the catalog. This makes it complementary to tools like lakeFS (which versions data) and ideal for multi-table transactional workflows. Read more on <a href="https://github.com/projectnessie/nessie" target="_blank" rel="noopener noreffer">GitHub</a>.</p>
<h4 id="bauplan">Bauplan</h4>
<p>Similar to LakeFS, Bauplan calls itself the programmable data lake and is a code-native platform for versioned pipelines, built on Apache Iceberg and initially optimized for ML. It&rsquo;s not open source. Bauplan is built on a Python-first serverless lakehouse and is rather new.</p>
<p>Bauplan treats your data lake as a Git repository where:</p>
<ul>
<li><strong>Data branches</strong> are first-class citizens, not just pipeline configs.</li>
<li>Every pipeline execution is a commit with full lineage.</li>
<li>All tables use Apache Iceberg format (Delta Lake compatible).</li>
</ul>













  
<figure><a target="_blank" href="/blog/git-for-data-tools/bauplan2.webp" title="">

</a><figcaption class="image-caption">Architectural overview from <a href="https://www.bauplanlabs.com/" target="_blank" rel="noopener noreffer">Bauplan Website</a></figcaption>
</figure>
<p>Creating an isolated branch with new snapshots of Iceberg tables from the CLI is as simple as this:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">client</span><span class="o">.</span><span class="n">create_branch</span><span class="p">(</span><span class="s1">&#39;experiment&#39;</span><span class="p">)</span>  <span class="c1"># Instant, zero data copying</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>It supports merging verified using <a href="https://alloytools.org/" target="_blank" rel="noopener noreffer">Alloy</a> model checking:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">client</span><span class="o">.</span><span class="n">merge_branch</span><span class="p">(</span><span class="n">source</span><span class="o">=</span><span class="s1">&#39;experiment&#39;</span><span class="p">,</span> <span class="n">target</span><span class="o">=</span><span class="s1">&#39;main&#39;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>The way it works is that it integrates a commit&rsquo;s changes into another branch and uses Alloy, a lightweight model checker, to stress-test the core logic behind merging (also used for checking branching and commits).</p>
<p>The merge operation tries to detect conflicts at the table level, performs three-way merges for compatible changes, and creates merge commits preserving lineage. Find more info on <a href="https://www.bauplanlabs.com/post/git-for-data-formal-semantics-of-branching-merging-and-rollbacks-part-1" target="_blank" rel="noopener noreffer">Git-for-Data Semantics: Safe Branching &amp; Merging at Scale</a> or their implementation of the <a href="https://www.bauplanlabs.com/post/write-audit-publish-ship-data-safely-move-faster" target="_blank" rel="noopener noreffer">WAP pattern</a>.</p>
<p>Bauplan brings Git&rsquo;s full semantic model with branch, merge, commit, and revert to lakehouse data while maintaining compatibility with standard Iceberg tables accessible from MotherDuck, Snowflake, Databricks, or Trino.</p>
<blockquote>
<p>[!tip] Software Modeling with Alloy</p>
<p>I haven&rsquo;t heard of Alloy before, but it&rsquo;s used not to model data, but for software modeling. It&rsquo;s used for a wide range of applications from finding holes in security mechanisms to designing telephone switching networks. And now for git for data with Bauplan.</p>
</blockquote>
<blockquote>
<p>[!note] New Whitepaper Out</p>
<p>After this article was written, Bauplan released a new whitepaper on <a href="https://arxiv.org/pdf/2602.02335" target="_blank" rel="noopener noreffer">Building a Correct-by-Design Lakehouse</a> that researches around pipeline boundaries with Git-like data versioning for review and reproducibility, and transactional runs that guarantee pipeline-level atomicity.</p>
</blockquote>
<h3 id="transactional-and-oltp-databases">Transactional and OLTP Databases</h3>
<p>These are row-oriented, ACID-compliant databases where Git-like versioning applies mostly to application data where we need to keep user records, orders, and schemas.</p>
<p>Supabase, Neon and Dolt are interesting because these are not data lakes, not based on object storage, and not analytical databases, but relational databases.</p>
<h4 id="supabase">Supabase</h4>
<p><a href="https://supabase.com/docs" target="_blank" rel="noopener noreffer">Supabase</a>&rsquo;s core approach is full instance branching. Each branch is a completely isolated Postgres database with the entire Supabase stack (Auth, Storage, Realtime, Edge Functions).</p>
<p>Supabase branches create <strong>separate environments</strong> that spin off from your main project, allowing you to test changes like new configurations, database schemas, or features without affecting production.</p>
<p>It works by creating a Git branch and opening a pull request. Supabase automatically launches a Preview Branch and runs migrations from the repository&rsquo;s migrations directory. Each branch gets a dedicated Postgres instance with a unique connection string and APIs, isolating them from production and other branches.</p>
<p>Creating a branch via GitHub integration:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># Automatic with GitHub integration enabled</span>
</span></span><span class="line"><span class="cl">git checkout -b feature/new-reports
</span></span><span class="line"><span class="cl">git push origin feature/new-reports
</span></span><span class="line"><span class="cl"><span class="c1"># Supabase automatically creates preview branch when PR is opened</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Or via the CLI:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">supabase branches create feature-branch --project-ref your-project
</span></span></code></pre></td></tr></table>
</div>
</div><p>When merging, migrations in the repository&rsquo;s migrations folder run incrementally on each commit, allowing you to verify schema changes on existing seed data. When you merge the PR, those migrations automatically apply to production.</p>
<p>As each branch is a new Postgres instance created from scratch, the approach is conceptually simple but requires branches to be seeded (manually populated with test data since production data isn&rsquo;t copied) with data since they start empty. Each branch incurs its own compute and storage costs. Read more on <a href="https://supabase.com/docs/guides/deployment/branching" target="_blank" rel="noopener noreffer">Branching Supabase Docs</a>.</p>
<p>Ideal for full-stack development where you need the entire backend stack (database + auth + storage + functions) to test features end-to-end.</p>
<h4 id="neon">Neon</h4>
<p><a href="https://neon.com/docs/" target="_blank" rel="noopener noreffer">Neon</a> is a serverless Postgres platform (now part of Databricks) whose core approach is <strong>copy-on-write storage-level branching</strong>. Unlike Supabase which spins up a full new instance, Neon <a href="https://neon.com/docs/introduction/branching" target="_blank" rel="noopener noreffer">branches</a> at the storage layer, making them instant regardless of database size and including the actual data.</p>
<p>Each branch is a new timeline in Neon&rsquo;s custom storage engine. No data is physically copied. The branch simply starts from a pointer to the parent&rsquo;s state at a specific LSN (log sequence number). Pages only diverge when writes happen, so you&rsquo;re billed only for the delta.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># Create a branch from the CLI</span>
</span></span><span class="line"><span class="cl">neon branches create --name feature/user-auth
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Branch from a specific point in time</span>
</span></span><span class="line"><span class="cl">neon branches create --name recovery --parent 2025-01-15T10:00:00Z
</span></span></code></pre></td></tr></table>
</div>
</div><p>Neon also supports <strong><a href="https://neon.com/docs/ai/ai-database-versioning" target="_blank" rel="noopener noreffer">snapshots</a></strong> (named, immutable point-in-time saves, like git tags) and <strong>rollback</strong> via <code>finalize_restore: true</code>, which restores a snapshot onto the active branch in-place while preserving the stable connection string.  There&rsquo;s no reconfiguration needed. For safe experimentation, <code>finalize_restore: false</code> creates a temporary preview branch instead.</p>
<p>The key limitation: <strong>Neon has no merge support</strong>. Branches diverge but can&rsquo;t be reconciled automatically. Changes are applied back to production using standard migration tools.</p>
<p>Ideal for database-focused workflows where you want instant, full-data branches with production-like data out of the box, and don&rsquo;t need the full backend stack.</p>
<h4 id="dolt-git--mysql">Dolt: Git + MySQL</h4>
<p><a href="https://github.com/dolthub/dolt" target="_blank" rel="noopener noreffer">Dolt</a> is a SQL database that you can fork, clone, branch, merge, push, and pull just like a Git repository. It&rsquo;s a MySQL-compatible database and is fully open-source. Dolt&rsquo;s core approach is a SQL database where every row is versioned, combining Git&rsquo;s commit graph with MySQL&rsquo;s query interface.</p>
<p>Dolt stores data in a <strong>content-addressed graph</strong> using <a href="https://docs.dolthub.com/architecture/storage-engine/prolly-tree" target="_blank" rel="noopener noreffer">Prolly Trees</a>, a novel data structure that enables cell-level version history, efficient structural sharing between versions, and fast diffs and merges.</p>
<p>Every database operation can be committed with:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">employees</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Alice&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">50000</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">DOLT_COMMIT</span><span class="p">(</span><span class="s1">&#39;-am&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Add Alice to payroll&#39;</span><span class="p">);</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>The commit creates a snapshot of the entire database state at that moment, stored in the commit graph just like Git. Unlike traditional databases, you can <strong>diff any two versions</strong>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="c1">-- See what changed between commits
</span></span></span><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">DOLT_DIFF</span><span class="p">(</span><span class="s1">&#39;main&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;feature-branch&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;employees&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">-- Show cell-level changes
</span></span></span><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">DOLT_COMMIT_DIFF_employees</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">from_commit</span><span class="o">=</span><span class="s1">&#39;abc123&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">to_commit</span><span class="o">=</span><span class="s1">&#39;def456&#39;</span><span class="p">;</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>This enables <strong>cell-level audit trails</strong> with diffs showing which rows were added/deleted/modified, which cells changed with their before/after values, and who made the change via commit metadata.</p>
<p>Dolt implements Git commands almost literally. You can run <code>dolt</code> with any of these commands: <code>branch feature-123</code>, <code>checkout feature-123</code>, <code>add .</code>, <code>commit -m &quot;Add new customers&quot;</code>, <code>push origin feature-123</code>, <code>checkout main</code>, <code>merge feature-123</code>.</p>
<p>You can even push/pull to DoltHub (like GitHub for databases) or run Dolt as a MySQL replica for existing applications.</p>
<p>Dolt uses <strong>copy-on-write with structural sharing</strong> where unchanged rows are shared between branches via pointers, and modified rows create new leaf nodes in the Prolly Tree.</p>
<p>This means cloning isn&rsquo;t &ldquo;free&rdquo; like with lakeFS, but it provides true database semantics with ACID transactions.</p>
<p>There&rsquo;s much more. Read more on their <a href="https://github.com/dolthub/dolt" target="_blank" rel="noopener noreffer">GitHub</a>.</p>
<blockquote>
<p>[!note] Worth noting</p>
<p><a href="https://docs.doltgres.com" target="_blank" rel="noopener noreffer">DoltgreSQL</a>, the Postgres-compatible version of Dolt, reached Beta in 2025 and is available on Hosted Dolt. If your stack is Postgres-based, DoltgreSQL brings the same Git-like versioning semantics without requiring a MySQL migration.</p>
</blockquote>
<h3 id="analytical-databases--warehouses">Analytical Databases &amp; Warehouses</h3>
<p>These tools are OLAP-style and analytical-style databases optimized for read-heavy analytical queries.</p>
<h4 id="motherduck">MotherDuck</h4>
<p>MotherDuck, as a cloud data warehouse, implements versioning differently from dedicated Git-for-data tools, prioritizing operational convenience over full version control semantics. With the addition of <strong><a href="https://motherduck.com/docs/concepts/snapshots/" target="_blank" rel="noopener noreffer">named snapshots</a></strong>, it gets even closer to Git-like semantics.</p>
<p>It offers two types of snapshots. <strong>Automatic snapshots</strong>: Created continuously in the background (roughly every minute when no writes are active). These are governed by <code>SNAPSHOT_RETENTION_DAYS</code>. These are configurable up to 90 days on the Business plan, defaulting to 7 days. They provide point-in-time recovery without any manual intervention.</p>
<p>And <strong>named snapshots</strong> that you create explicitly with <code>CREATE SNAPSHOT</code>. These are not subject to garbage collection as they persist indefinitely, even if the source database is deleted. Think of them as <strong>Git tags for your database</strong>, a permanent bookmark of a known-good state you can always return to.</p>
<p>The git analogy maps well:</p>
<ol>
<li><strong><code>CREATE SNAPSHOT</code></strong> → <code>git tag</code>:  bookmark a known-good state</li>
<li><strong><code>CREATE DATABASE ... FROM</code></strong> → <code>git checkout -b</code>: isolated environment from a snapshot</li>
<li><strong><code>ALTER DATABASE SET SNAPSHOT TO</code></strong> → <code>git reset --hard</code>: roll back to a previous state</li>
<li><strong><code>UNDROP DATABASE</code></strong> → recovering a deleted branch</li>
</ol>
<p>Combined with <a href="https://motherduck.com/docs/sql-reference/motherduck-sql-reference/create-database/" target="_blank" rel="noopener noreffer">zero-copy cloning</a> and <a href="https://motherduck.com/docs/key-tasks/sharing-data/sharing-overview/" target="_blank" rel="noopener noreffer">database sharing</a>, this enables practical git-like workflows. While MotherDuck doesn&rsquo;t support Git-style merging, <a href="https://motherduck.com/docs/sql-reference/motherduck-sql-reference/copy-database-overwrite/" target="_blank" rel="noopener noreffer"><code>COPY FROM DATABASE (OVERWRITE)</code></a> acts as a replace, somewhat like a merge without conflict resolution. Combined with snapshots and <a href="https://motherduck.com/docs/sql-reference/motherduck-sql-reference/create-database/" target="_blank" rel="noopener noreffer">zero-copy clones</a>, this gives you a practical branch-modify-promote workflow:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="c1">-- 1. Snapshot production before changes (persists indefinitely)
</span></span></span><span class="line"><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">SNAPSHOT</span><span class="w"> </span><span class="s1">&#39;pre_release_v2&#39;</span><span class="w"> </span><span class="k">OF</span><span class="w"> </span><span class="n">production</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">-- 2. Clone from that named snapshot to an isolated dev database (instant, zero-copy)
</span></span></span><span class="line"><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">DATABASE</span><span class="w"> </span><span class="n">dev_branch</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">production</span><span class="w"> </span><span class="p">(</span><span class="n">SNAPSHOT_NAME</span><span class="w"> </span><span class="s1">&#39;pre_release_v2&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">-- Or clone from a point in time: (SNAPSHOT_TIME &#39;2026-01-28 08:00:00&#39;)
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">-- 3. Make and validate changes on dev_branch
</span></span></span><span class="line"><span class="cl"><span class="c1">-- ... run transforms, test queries ...
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">-- 4. Promote: overwrite production with dev_branch (instant, metadata-only)
</span></span></span><span class="line"><span class="cl"><span class="k">COPY</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="k">DATABASE</span><span class="w"> </span><span class="n">dev_branch</span><span class="w"> </span><span class="p">(</span><span class="n">OVERWRITE</span><span class="p">)</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="n">production</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">-- 5. If something goes wrong, restore from snapshot
</span></span></span><span class="line"><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">DATABASE</span><span class="w"> </span><span class="n">production</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">SNAPSHOT</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="p">(</span><span class="n">SNAPSHOT_NAME</span><span class="w"> </span><span class="s1">&#39;pre_release_v2&#39;</span><span class="p">);</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>This operates purely at the metadata layer and is nearly instantaneous. It&rsquo;s not a true merge (it&rsquo;s a full replacement, not a diff-based reconciliation), but for many data workflows where you want to validate changes in isolation before promoting them, it covers the key use case.</p>
<blockquote>
<p>[!example] Deep Dive</p>
<p>If you want to know even more about how to use named snapshots and generally rolling back to a certain time, this blog <a href="https://motherduck.com/blog/point-in-time-restore/" target="_blank" rel="noopener noreffer">More Control, Less Hassle: Self-Serve Recovery with Point-in-Time Restore</a> goes into more details.</p>
</blockquote>
<h4 id="ducklake">DuckLake</h4>
<p><a href="https://ducklake.select/" target="_blank" rel="noopener noreffer">DuckLake</a> is the open lakehouse format that uses a SQL database as its metadata catalog instead of JSON/Avro manifest files. DuckLake is relatively new (with 1.0 around the corner and its first release in May 2025), so you could use other mature open table formats like <a href="https://github.com/apache/iceberg" target="_blank" rel="noopener noreffer">Apache Iceberg</a>, <a href="https://github.com/delta-io/delta" target="_blank" rel="noopener noreffer">Delta Lake</a> or <a href="https://github.com/apache/hudi" target="_blank" rel="noopener noreffer">Apache Hudi</a>.</p>
<p>But DuckLake has its relevancy for git-like workflows because:</p>
<ol>
<li><strong>Snapshots are Git commits</strong>: Every DuckLake change creates a snapshot with author, commit message, and changeset tracking. This is the closest to actual Git semantics in the data lake world.</li>
<li><strong>SQL-native metadata</strong>: Uses DuckDB/PostgreSQL/MySQL as catalog, so metadata operations are standard SQL transactions. No manifest file scanning or compaction storms like Iceberg.</li>
<li><strong>Millions of snapshots</strong>: Snapshots are just a few rows in the catalog DB. No need to proactively prune snapshots (a major operational burden with Iceberg).</li>
<li><strong>Time travel + change feed</strong>:  Query any table at any version, track insertions/deletions between versions.</li>
</ol>
<p><strong>With MotherDuck</strong> (fully managed):</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="c1">-- Fully managed DuckLake on MotherDuck
</span></span></span><span class="line"><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">DATABASE</span><span class="w"> </span><span class="n">my_lake</span><span class="w"> </span><span class="p">(</span><span class="k">TYPE</span><span class="w"> </span><span class="n">DUCKLAKE</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">-- Or bring your own S3 bucket
</span></span></span><span class="line"><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">DATABASE</span><span class="w"> </span><span class="n">my_lake</span><span class="w"> </span><span class="p">(</span><span class="k">TYPE</span><span class="w"> </span><span class="n">DUCKLAKE</span><span class="p">,</span><span class="w"> </span><span class="n">DATA_PATH</span><span class="w"> </span><span class="s1">&#39;s3://my-bucket/lake/&#39;</span><span class="p">);</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><blockquote>
<p>[!example] DuckLake Example</p>
<p>See valuable examples and DuckLake workflows in <a href="https://github.com/matsonj/ducklake-workshop" target="_blank" rel="noopener noreffer">DuckLake workshop</a>.</p>
</blockquote>
<h2 id="related-data-engineering-git-like-workflows">Related Data Engineering Git-like Workflows</h2>
<p>Besides storage for data, which is the most important part and at the same time the hardest as we need to deal with state, it&rsquo;s not the full picture. We have DataOps to handle the full picture.</p>
<p>Data pipelines and their code also need to be deployed on a clone or branch, so how do we do this? One example is orchestration.</p>
<h3 id="orchestration-dagster-branch-deployments">Orchestration: Dagster Branch Deployments</h3>
<p>If we look at the full picture of the data engineering lifecycle, we need more than just storing data in a git-like manner. To support the full lifecycle, it would be best to run everything in a git-like style to roll back or switch branches. It&rsquo;s great to see that orchestrator tools like Dagster and others also have this functionality included.</p>
<p>Meaning branching does not only apply to the data, but also to data pipelines, and we can set a run automatically. Dagster is doing that with their cloud solution, integrating GitHub workflows with PRs and actions.</p>
<p>Dagster&rsquo;s core approach is lightweight staging environments created automatically with every pull request that branch both code <em>and</em> data. <strong><a href="https://docs.dagster.io/deployment/dagster-plus/deploying-code/branch-deployments" target="_blank" rel="noopener noreffer">Branch deployments</a></strong> deploy your branch on Dagster+ as a separate deployment. This only works if your underlying technology supports cloning. For example, as we&rsquo;ve seen, one of the above tools that supports cloning will allow Dagster inside the deployment to clone relevant data into that new branch deployment.</p>













  
<figure><a target="_blank" href="/blog/git-for-data-tools/dagster.webp" title="">

</a><figcaption class="image-caption">Branch deployment workflow showing how code branches deploy to cloned schema</figcaption>
</figure>
<p>On PR creation, it will automatically create a staging environment with a branch, launch jobs to configure the test environment including cloned data(base), and allow parameterized pipelines to test. If the tests pass, you can approve the PR, and it merges and automatically deploys to production with the right CI/CD pipeline.</p>
<p>Orchestrators and other data stack tools depend on cloning support and features such as branching for a true isolated environment. As Nick Schrock noted in the <a href="https://www.dataengineeringpodcast.com/dagster-software-defined-assets-data-orchestration-episode-309/" target="_blank" rel="noopener noreffer">Data Engineering Podcast</a>, this is similar to the challenge with Apache Spark where testing locally is nearly impossible. Branch deployments solve this by branching the entire environment.</p>
<p>This is extremely powerful as it replaces the need to copy data locally or set up complex staging environments. You get a true production-like test environment that&rsquo;s automatically created and destroyed with your git workflow. Read more on <a href="https://docs.dagster.io/dagster-plus/managing-deployments/branch-deployments" target="_blank" rel="noopener noreffer">Dagster Branch Deployments</a>.</p>
<h3 id="ai-agents-a-branch-for-testing">AI Agents: A Branch for Testing</h3>
<p>Lastly, this also works well in the realm of AI agents that help us test based on a branch or snapshot. This is similar to <a href="https://git-scm.com/docs/git-worktree" target="_blank" rel="noopener noreffer">git worktree</a> for small git repos with code where basically each branch is a separate folder and we can work and change different branches simultaneously without breaking any of the other branches or data.</p>
<p>Once we have a working branch with data <strong>included in isolation</strong>, we can send off an agent autonomously, and let it open a PR to review. This way we have a clear gateway before it goes to production, we can test it on that branch, including its data, and merge when all looks good.</p>
<p>Based on its own fork, we can avoid collisions, instantly roll back or delete a branch and start again, have perfect consistency as data is frozen and locked for the agent to work on, and clean debugging as no other ETL data pipelines interfere.</p>
<h2 id="conclusion">Conclusion</h2>
<p>So where does this leave us? In <a href="/blog/git-for-data-theory" rel="">Part 1</a>, we established that Git for data is fundamentally harder than versioning code because we&rsquo;re managing state at massive scale. We learned about the efficiency spectrum, from metadata pointers to full copies, and why zero-copy operations matter.</p>
<p>Now, having explored the actual tools and their approaches to git-like workflows (LakeFS, Dolt, Nessie, MotherDuck, and others in production today), we know a little more about how it all works. Each tool makes different trade-offs, but they all solve the same core problem: how do you version data without copying petabytes.</p>
<p>The answer, to me: <strong>separate metadata from data</strong>. Whether it&rsquo;s LakeFS&rsquo;s random physical addresses, Dolt&rsquo;s Prolly Trees, Nessie&rsquo;s catalog pointers, MotherDuck&rsquo;s zero-copy clones, or Neon&rsquo;s branching feature, they all use clever tricks to make branching instant. Some focus on data lakes, others on databases. Some support full merge workflows, others prioritize instant forking. Your choice depends on your stack:</p>
<ul>
<li>LakeFS and Nessie excel at data lake branching with zero-copy efficiency</li>
<li>Dolt brings true Git semantics to SQL databases</li>
<li>MotherDuck offers named snapshots and zero-copy clones for cloud data warehousing, with DuckLake adding SQL-native time travel</li>
<li>Bauplan focuses on versioned pipelines and ML experiment reproducibility</li>
<li>Neon and Supabase provide branch/fork-based workflows for isolated testing</li>
</ul>
<p>The ecosystem is still evolving. Maturity varies across tools, with different workarounds to limitations that best fit data in a git-like workflow. Some trade merge capabilities for instant forking. Others require infrastructure changes. The key is picking what fits your workflow and scale.</p>
<p><strong>Start small.</strong> You don&rsquo;t need to instrument your entire stack overnight. Look at your recent production incidents: which pipelines caused them? Those are your highest-risk areas. Add branching there first. Test changes on prod-like data before deploying. Build confidence through small wins, then expand.</p>
<p>We want to bring the same <strong>confidence</strong> we have with code versioning to the stateful world of data. And with tools like Dagster&rsquo;s branch deployments and emerging AI agent workflows, we&rsquo;re seeing Git-like patterns extend beyond just data storage into the full data engineering lifecycle.</p>
<p>Git-like workflows are becoming table stakes. Maybe not today or tomorrow, but with the right tools and changes in workflow we can achieve significantly better change management, testing on production data, fast rollbacks, isolated experiments, and most importantly, peace of mind when deploying changes.</p>
<p>That&rsquo;s the promise. What&rsquo;s your experience? Have you tried it? Do you run any of the above in production? I&rsquo;m curious to hear more.</p>
<h2 id="appendix">Appendix</h2>
<p>While I was writing this article back in November 2025, Tigris was an interesting database contender with Supabase-like features such as forked buckets and zero clone. But at the time of this publishing, the <a href="https://github.com/tigrisdata-archive/tigris" target="_blank" rel="noopener noreffer">GitHub repo</a> got archived, and therefore removed from the comparison in this article.</p>
<hr>
<pre class=""><em>Full article published at <a href="https://motherduck.com/blog/git-for-data-part-2/" target="_blank" rel="noopener noreferrer">MotherDuck.com</a> - written as part of <a href="/services">my services</a></em></pre>
]]></description>
</item>
<item>
    <title>Building an Obsidian RAG with DuckDB and MotherDuck</title>
    <link>https://www.ssp.sh/blog/obsidian-rag-duckdb-sql/</link>
    <pubDate>Fri, 13 Feb 2026 00:00:08 &#43;0200</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/obsidian-rag-duckdb-sql/</guid><enclosure url="https://www.ssp.sh/blog/obsidian-rag-duckdb-sql/featured-image.png" type="image/png" length="0" /><description><![CDATA[<p>I always wanted a personal knowledge assistant based on my notes. One that uses Obsidian&rsquo;s backlinks and connections to surface ideas I&rsquo;ve forgotten or never thought to link together.</p>
<p>So I built one. A RAG system that runs locally with DuckDB as a <a href="/blog/vector-technologies-ai-data-stack/" rel="">vector database</a>, then syncs to MotherDuck for a serverless web app running entirely in the browser via WASM. Think of it like J.A.R.V.I.S<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> for your markdown files: search about a topic, and it shows connected notes up to two hops away, semantically similar content, and hidden connections between ideas that share no direct links.</p>
<p>In this article, I walk through how I built this and how it works, from using DuckDB&rsquo;s vector extension locally to serving embeddings through MotherDuck&rsquo;s WASM client. Along the way, you&rsquo;ll see how data engineering skills can make use of lots of note-markdown files. If you want to dive straight into the code, it&rsquo;s all on GitHub at <a href="https://github.com/sspaeti/obsidian-note-taking-assistant" target="_blank" rel="noopener noreffer">Obsidian-note-taking-assistant</a>, and you can try the web app on my public notes at <a href="https://explore.ssp.sh" target="_blank" rel="noopener noreffer">Explore RAG</a>.</p>
<p>For building the web app I used Claude Code and it came together in a few hours using the <code>plan mode</code>. This approach is powerful for any data engineer building pipelines or related work, especially when you have a clear vision of what you want. The big productivity boost wasn&rsquo;t only the model getting smarter, in my opinion, but something else, more on that in the article.</p>
<p>This is how it looks. Let&rsquo;s talk about how I built it and some behind the scenes.<br>













  
<figure><a target="_blank" href="/blog/obsidian-rag-duckdb-sql/output3.gif" title="">

</a><figcaption class="image-caption">Short showcase of the web app, working locally or as shown here published on Vercel</figcaption>
</figure></p>
<h2 id="vision--why-i-built-this">Vision &amp; Why I Built This</h2>
<p>I have 8963 local notes (according to <code>find . -type f -name '*.md' | wc -l</code>) in my Obsidian vault, some are very long, and there are more images and PDFs connected. Wouldn&rsquo;t it be nice to have an insight from my own thinking a while back, or some quotes I forgot<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>, or things you didn&rsquo;t think of?</p>
<p>The requirements that I set myself were to use Obsidian backlinks as these are already curated and well structured as a graph-like organization. I wanted to see notes that are multiple hops away and hard to see without a tool. I wanted to search non-obvious neighbors or similarities and also show me hidden connections that would be interesting, both locally and online. These are especially helpful in the brainstorming and initial phase when starting an article or a note, giving me new ideas on existing notes I have written once in my life.</p>
<p>Examples could look like this:</p>
<blockquote>
<p>Show me my notes on Functional Data Engineering that relate to my current article (one or two hops)</p>
</blockquote>
<blockquote>
<p>Notes that are relevant from my vault. Or related ideas</p>
</blockquote>
<blockquote>
<p>Highlight any disagreements between the notes</p>
</blockquote>
<blockquote>
<p>Give me all notes I took on these matters and related, and give me the source note from my Obsidian vault</p>
</blockquote>
<p>Such a tool is especially helpful during brainstorming when writing my articles, or when I journal some ideas or when solving a hard problem. All of this should be local, but also available as a web app, so I can share it with you and connect it to my public second brain.</p>
<h3 id="starting-position">Starting Position</h3>
<p>With Obsidian, there are many Obsidian plugins such as <a href="https://github.com/SkepticMystic/graph-analysis" target="_blank" rel="noopener noreffer">Graph Analysis</a>, <a href="https://github.com/brianpetro/obsidian-smart-connections" target="_blank" rel="noopener noreffer">Obsidian Smart Connections</a> and many more, that let you do similar things. But some require to hook up a public AI provider, don&rsquo;t work very well anymore, or don&rsquo;t do exactly what I wanted.</p>
<p>The easiest would be to use Claude Code or any other agents, as it&rsquo;s just Markdown files, but again, then you <strong>give away all your sensitive, potentially insightful notes</strong> and thoughts. That&rsquo;s why I wanted to build an Obsidian knowledge assistant that is trained based on my data. I started with a simple Retrieval-Augmented Generation (RAG) system that uses DuckDB for storing vectors. I used <a href="https://duckdb.org/docs/stable/core_extensions/vss" target="_blank" rel="noopener noreffer">Vector Similarity Search Extension</a> for storing vectors and did a couple of tests with Claude Code.</p>
<p>I shared it online and got <a href="https://www.linkedin.com/feed/update/urn:li:activity:7417544619158171648?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A7417544619158171648%2C7417588137956245506%29&amp;replyUrn=urn%3Ali%3Acomment%3A%28activity%3A7417544619158171648%2C7417601077690351616%29&amp;dashCommentUrn=urn%3Ali%3Afsd_comment%3A%287417588137956245506%2Curn%3Ali%3Aactivity%3A7417544619158171648%29&amp;dashReplyUrn=urn%3Ali%3Afsd_comment%3A%287417601077690351616%2Curn%3Ali%3Aactivity%3A7417544619158171648%29" target="_blank" rel="noopener noreffer">helpful feedback</a> to use a specific model <a href="https://huggingface.co/BAAI/bge-m3" target="_blank" rel="noopener noreffer">bge-m3</a> and integrated it as much as possible with the help of agents. I added the above requirements that it should use Obsidian native links and train based on my vault.</p>
<p>This was my first round. Building a job that creates chunks and ingests them into DuckDB with the vector extension <a href="https://duckdb.org/docs/stable/core_extensions/vss" target="_blank" rel="noopener noreffer">Vector Similarity Search Extension</a>.</p>
<p>I used two different modes, as the above takes more time to generate embeddings. I could run the BGE-M3 overnight and it was done after ~2 hours, not on all my notes, but on my public notes, which are 584.</p>













  
<figure><a target="_blank" href="/blog/obsidian-rag-duckdb-sql/btop.webp" title="">

</a><figcaption class="image-caption">Running btop as activity overview while running the ingestion and creating embeddings on my laptop - Using mostly CPU at 45%</figcaption>
</figure>
<h3 id="local-first">Local-First</h3>
<p>I started with the local-first approach because I want to be independent, and also I have sensitive or valuable notes that I don&rsquo;t just want to give away or upload to the cloud.</p>
<p>But there are also other reasons why you might want to use a local model. Some say:</p>
<blockquote>
<p>A.I. research done by a cloud service will hallucinate because you have <strong>no control over the weights or limits of the LLM</strong>. This is why anyone who wants to do A.I. should run their projects locally including Deep Research. <a href="https://bsky.app/profile/gostack.bsky.social/post/3mdcvdzglus2a" target="_blank" rel="noopener noreffer">Bsky</a></p>
</blockquote>
<p>Additionally, a local model with lots of your own context to research with will be better suited for your use case. It doesn&rsquo;t mean that it does not hallucinate, but what I find most useful is that suggestions and ideas are based on my own notes, which I sometimes have forgotten, or if new ideas, they are combined based on my research.</p>
<h3 id="web-app">Web App</h3>
<p>I added a web app that uploads the generated embeddings to MotherDuck and uses <a href="https://duckdb.org/docs/stable/clients/wasm/overview" target="_blank" rel="noopener noreffer">DuckDB WASM</a> to serve in the client (web browser), so I could share the findings easily with anyone interested in my second brain notes.</p>
<p>This went really well, and I share all the details at the end of this article, with some lessons learned and how you can do it for yourself too.</p>
<h2 id="knowledge-assistant-building-a-rag-for-data-engineers">Knowledge Assistant: Building a RAG for Data Engineers</h2>
<p>Now let&rsquo;s get to the building part. As initially explained, this article converts data engineering knowledge into a searchable tool. Hopefully finding new insights, related topics, and learning something new.</p>
<p>This is now done on top of my <a href="https://www.ssp.sh/brain" target="_blank" rel="noopener noreffer">public (mostly) data engineering notes</a>, but we might add code snippets, interesting quotes, etc. To me, all of these might just be text files, and mostly markdown, that&rsquo;s why this system based on text files is so powerful. We can use it as context to help us more.</p>
<p>The outcome and connected web app looks like this:</p>
<p>






</p>
<h3 id="what-we-built-retrieval-without-the-llm">What We Built: Retrieval Without the LLM</h3>
<p>A <a href="https://motherduck.com/blog/search-using-duckdb-part-2/" target="_blank" rel="noopener noreffer">Retrieval-Augmented Generation (RAG)</a> system that is trained on our notes that we have (we use Markdown). More specifically: Obsidian Markdown, that has the advantage of links and backlinks that give us additional clues we can use.</p>
<p>RAG in particular is a technique that can provide more accurate results to queries than a generative large language model on its own because RAG uses knowledge external to data already contained in the Large Language Models (LLMs).</p>
<p>So what we built is only the Retrieval and Augmented part. We don&rsquo;t use an LLM yet, only retrieval of relevant and hidden notes based on a search. Specifically notes, code snippets as parts of notes, and other relevant ideas.</p>
<h3 id="architecture-with-embed-model-motherduck-and-nextjs">Architecture with Embed Model, MotherDuck and Next.js</h3>
<p>First I had to split my notes into separate chunks and connect relevant links.<br>
This is done through an embedding model that converts text into numerical vectors, so we can compare meaning rather than just keywords.</p>
<p>This runs locally and two models can be used: <strong>all-MiniLM-L6-v2</strong> (384 dimensions, fast for testing) and <strong>BAAI/bge-m3</strong> (1024 dimensions, production quality). This is the top-level Python code in the GitHub repo. It <strong>provides a CLI and DuckDB database</strong> where we can search semantically, discover hidden notes, or traverse connected notes up to two hops away.</p>
<p>The chunking is markdown-aware: it respects heading boundaries, preserves code blocks intact, and splits on paragraph breaks. Each chunk stays around <strong>512 characters</strong> and carries its heading context along. Before embedding, I prepend the note title and section heading to each chunk (e.g., <code>&quot;Title: DuckDB | Section: Installation | actual content...&quot;</code>).</p>
<p>This acts as a semantic anchor and noticeably improves retrieval quality.</p>
<p>Disclaimer: I don&rsquo;t have deep expertise in building RAG systems and semantic search, so this is built on the best of my knowledge and what helps me most in my daily work.</p>
<p>The ingestion pipeline creates these tables with relevant information:</p>
<ul>
<li>notes: Note metadata, content, frontmatter</li>
<li>links: Wikilink graph edges</li>
<li>chunks: Chunked content for RAG retrieval</li>
<li>embeddings: 1024-dim vectors (BAAI/bge-m3)</li>
<li>hyperedges: Multiway relations (tags, folders)</li>
<li>hyperedge_members: Note membership in hyperedges</li>
</ul>
<p>The second part is a <strong>web app</strong> served via a Next.js UI and a MotherDuck WASM client that connects directly to the MotherDuck cloud database from the browser.</p>
<p>This means no database server to set up or maintain. I added a FastAPI service on Railway to serve the BGE-M3 embedding model, which avoids API costs from Hugging Face (and also makes it reliable, since Hugging Face&rsquo;s inference API kept timing out with the BGE-M3 model).</p>
<p>The architecture uses mostly serverless components:<br>













  

























<figure>
<a target="_blank" href="/blog/obsidian-rag-duckdb-sql/mermaid.png" title="Simple Architecture of this Project">

</a><figcaption class="image-caption">Simple Architecture of this Project</figcaption>
</figure></p>
<p>Semantic search matches <strong>meaning</strong>, not keywords. When I search for &ldquo;how to model data in a warehouse,&rdquo; I want notes about dimensional modeling or dbt transformations to show up, even if they never use those exact words.</p>
<p>The BGE-M3 model converts each chunk into a 1024-dimensional vector, and we rank results by <strong>cosine similarity</strong> between the query and stored embeddings. Locally, DuckDB&rsquo;s VSS extension handles this with an HNSW index.</p>
<p>In the web app, MotherDuck&rsquo;s WASM client <a href="https://motherduck.com/docs/key-tasks/ai-and-motherduck/text-search-in-motherduck/#embedding-based-search" target="_blank" rel="noopener noreffer">doesn&rsquo;t have VSS</a>, so I compute cosine similarity manually with DuckDB&rsquo;s list functions. I was surprised how well DuckDB handles this without a dedicated vector database, one file for relational data and vectors together.</p>
<p>The &ldquo;graph-boosted search&rdquo; mode multiplies similarity by 1.2x for notes that are also graph-connected. Simple, but it surfaces better results because your link structure encodes intent that embeddings alone miss.</p>
<p>And the hidden connections feature, finding semantically close notes with no direct wikilink, turned out to be the most useful discovery tool.</p>
<p>It found links between notes I&rsquo;d written months apart and never thought to connect.</p>
<h3 id="running-it-on-your-own-vault">Running It on Your Own Vault</h3>
<p>As we constantly add and improve our &ldquo;second brain&rdquo;, this is very powerful, so we can just rerun the ingestion and we get the update.</p>
<p>This is built on my data, but you can use the <a href="https://github.com/sspaeti/obsidian-note-taking-assistant" target="_blank" rel="noopener noreffer">provided GitHub repo</a> and run the local <code>make ingest</code> job to run it on your own Obsidian vault or Markdown files. You&rsquo;ll get the same UI and CLI to ask questions about your notes out of the box.</p>
<p>The results are tailored to our interests, needs, and even notes, as we are the ones who wrote the notes down. Or if you took a lot of highlights via web clippers ReadWise read-it-later, Obsidian Webclipper, also from other authors, but still snippets that you chose to store.</p>
<p>To run it on your own notes, clone the repo, set <code>VAULT_PATH</code> in the <code>.env</code> file to your Obsidian vault (or any folder of Markdown files), and run <code>make ingest</code>.</p>
<p>The ingestion parses all <code>.md</code> files, chunks them, generates embeddings with the BGE-M3 model, and stores everything in a local DuckDB file. From there you have the full CLI with semantic search, backlinks, connections, and hidden link discovery.</p>
<p>If you want the web UI too, sync to MotherDuck with <code>make sync-motherduck</code> and deploy the Next.js app.</p>
<h3 id="the-final-result">The Final Result</h3>
<p>The result of this exercise is two parts with sub-components like this:</p>
<ul>
<li><strong>Ingestion pipeline</strong>: A local job that parses Obsidian markdown, chunks it, and generates embeddings using the BGE-M3 model. Run make ingest and the local DuckDB file is ready to query.</li>
<li><strong>Web app</strong> at <a href="https://explore.ssp.sh" target="_blank" rel="noopener noreffer">explore.ssp.sh</a>, composed of three services:
<ul>
<li><strong>Frontend</strong> on Vercel: Next.js app with MotherDuck WASM client running DuckDB queries directly in the browser.</li>
<li><strong>Database on MotherDuck</strong>: Cloud-hosted DuckDB, synced from local via make sync-motherduck. No server to manage.</li>
<li><strong>Embedding microservice on Railway</strong>: A FastAPI endpoint that hosts the BGE-M3 model and converts search queries into vectors on demand. The browser sends your search text, gets back a 1024-dim embedding, and uses it to query MotherDuck for similar chunks. This avoids running a ~1.8GB model in the browser and sidesteps Hugging Face API rate limits.</li>
</ul>
</li>
</ul>
<p>Here you can see backlinks and hops that go over two notes. The hops are interesting as we don&rsquo;t see this easily on a graph, or it&rsquo;s harder to showcase. That&rsquo;s why I added them besides the normal backlinks and outgoing links.<br>







</p>
<p>Find hidden connections. Here we see that AT Protocol, the protocol behind social media platform Bluesky and others, is connected to Ducklake. Something I wouldn&rsquo;t have associated myself:<br>





</p>
<p>Now we can compare notes, think why this could be, and what&rsquo;s the connection and insight we can gain from it. This is exactly why I built this, to get such insights.</p>
<blockquote>
<p>[!info] Clickable Links</p>
<p>Each note on <a href="https://explore.ssp.sh" target="_blank" rel="noopener noreffer">explore.ssp.sh</a> has a clickable link to my public brain at <code>ssp.sh/brain/[note-name]</code>.</p>
</blockquote>
<h2 id="lessons-learned-ai-agents-for-data-engineers">Lessons Learned: AI Agents for Data Engineers</h2>
<p>As you probably have noticed, since the Christmas break, the AI hype or enthusiasm around agents got very loud. One reason is that many got a good amount of time to actually test the latest. On the other hand, the models got better, and thirdly these AI companies provided new features such as Skills, cowork, and many more.</p>
<p>I myself also took some time and thought about how we can leverage agents for data engineering, especially Claude Code. But contradicting many who say the models got much better, I think the key to the boost of productivity is a different one. With <a href="https://getnao.io/" target="_blank" rel="noopener noreffer">nao</a>, ChatGPT, Claude, and probably others, we have had AI agents and models already for a while, but most powerful at the current moment are the agents in <code>plan mode</code>. It&rsquo;s the key to build longer and have us more in the loop.</p>
<p>But what is &ldquo;Plan Mode&rdquo; you might ask? The definition:</p>
<blockquote>
<p>Claude Plan Mode is a read-only state in Claude Code, an AI coding assistant, that lets it analyze a codebase, ask clarifying questions, and generate detailed implementation plans without making any actual file changes or executing commands, ensuring safety and structure before development begins. It&rsquo;s activated by cycling modes (often Shift+Tab) and is great for exploring, planning complex changes, and building context, allowing developers to approve the AI&rsquo;s strategy before actual coding starts. More on <a href="https://lucumr.pocoo.org/2025/12/17/what-is-plan-mode/" target="_blank" rel="noopener noreffer">What Actually Is Claude Code’s Plan Mode?</a></p>
</blockquote>
<p>With that, it&rsquo;s amazing what you can build. All the open todos we add to our backlog, we can now quickly build and test or solve, and think through the problem by actually laying out the step-by-step instructions. After it&rsquo;s built we get a feel for it quickly and can give better feedback on whatever job we have at hand right now.</p>
<p>Still we need to be careful to not just jump into building every little thing, as we could, because spending hours on something that we don&rsquo;t need is still wasting precious time.</p>
<p>I have experienced it myself often. I get the perception of being super productive, but after a couple of hours, or sometimes days, we actually didn&rsquo;t achieve what we needed. The idea we thought was cool didn&rsquo;t go anywhere, and we are mentally more exhausted because we didn&rsquo;t really do the heavy lifting, meaning we don&rsquo;t really understand what was generated. And potentially also didn&rsquo;t learn anything new.</p>
<p>With that in mind, we need to be careful when to use the new tools, certainly not always, but there are many ways. So how else should we use agents and AI as data engineers and knowledge workers?</p>
<blockquote>
<p>[!note] Plan Mode Support</p>
<p>Besides Claude, Plan Mode is widely adopted across AI coding assistants including <a href="https://cursor.com/blog/plan-mode" target="_blank" rel="noopener noreffer">Cursor</a> (October 2025), <a href="https://windsurf.com/blog/windsurf-wave-10-planning-mode" target="_blank" rel="noopener noreffer">Windsurf</a> (June 2025), <a href="https://github.blog/changelog/2025-11-18-plan-mode-in-github-copilot-now-in-public-preview-in-jetbrains-eclipse-and-xcode/" target="_blank" rel="noopener noreffer">GitHub Copilot</a> (VS Code, Visual Studio, JetBrains, Eclipse, Xcode), <a href="https://docs.lovable.dev/features/plan-mode" target="_blank" rel="noopener noreffer">Lovable</a>, <a href="https://support.bolt.new/best-practices/discussion-mode" target="_blank" rel="noopener noreffer">Bolt.new</a>, and <a href="https://blog.replit.com/introducing-plan-mode-a-safer-way-to-vibe-code" target="_blank" rel="noopener noreffer">Replit</a> (September 2025). Everyone is following a similar pattern of letting AI analyze, ask clarifying questions, and propose structured implementation plans before writing any code.</p>
</blockquote>
<h3 id="plan-mode-and-how-we-work-best-with-ai-agents">Plan Mode: And How We Work Best with AI Agents</h3>
<p>This is how we humans work best as well. We make a plan, and then execute it and adjust along the way. But it&rsquo;s also a great way to work with juniors, and in that sense, AI agents.</p>
<p>Because we say what we want in an abstract manner, the agent says what it would do in a plan form (just a markdown file, markdown runs the world these days), and then we as the <strong>senior, or the designer or architect</strong> can see if it missed our interpretation (as language is not precise), and we work on a great plan with all the details. This way we know it does what we expect it to do. And then it goes off and does it autonomously with access to the terminal and all command line tools.</p>
<p>But there&rsquo;s one more factor, it&rsquo;s the human factor. Whatever it builds, it builds on trained data. So it will use what most people use. Which might be ok for most cases, but maybe not if you want to build something unique, innovative. That&rsquo;s why I think for most writers, it&rsquo;s not the right tool to let it write the stuff for us. Just for that fact, but even more so, the character and soul of the person gets stripped away. The quirky things someone does, which make them who they are, that <strong>takes away from the fun</strong> of writing.</p>
<p>Obviously in coding, this is not the same. Except if you are another programmer and need to read the code, no? Because any data engineer would love to read the code from a human rather than an AI, it&rsquo;s kind of boring. But maybe it just needs to do the job, and not all human code is beautiful too, right?</p>
<blockquote>
<p>[!note] See the Prompt for the Web App</p>
<p>If you want to know how I built the web app without having experience in Next.js, I am sharing the <a href="https://github.com/sspaeti/obsidian-note-taking-assistant/blob/main/web-app/prompts/agents-webapp.md" target="_blank" rel="noopener noreffer">initial prompt</a> with plan mode that could be interesting. The summary of the full session (ca. 3-4 hours) is at <a href="https://github.com/sspaeti/obsidian-note-taking-assistant/blob/main/web-app/prompts/build-summary.md" target="_blank" rel="noopener noreffer">build-summary.md</a>.</p>
</blockquote>
<h3 id="where-are-we-heading">Where Are We Heading?</h3>
<p>So what about data engineering? Where are we today?</p>
<p>As I have written extensively about at <a href="https://www.rilldata.com/blog/has-self-serve-bi-finally-arrived-thanks-to-ai?" target="_blank" rel="noopener noreffer">Self-serve BI thanks to AI</a> or using it for <a href="https://www.rilldata.com/blog/data-modeling-for-the-agentic-era-semantics-speed-and-stewardship" target="_blank" rel="noopener noreffer">data modeling along with semantics, speed, and stewardship</a>, humans still need to be in the loop, and we need to be careful to not generate too much (ingestion logic, business logic, general code, or dashboards) that is unmaintainable or never needed in the first place.</p>
<p>On the other hand, there&rsquo;s no definitive answer right now, we are all just figuring it out. That&rsquo;s why some say it&rsquo;s the most exciting times, because everything is supposedly going to change. <a href="https://x.com/karpathy/status/2004607146781278521" target="_blank" rel="noopener noreffer">Andrej Karpathy</a> said:</p>
<blockquote>
<p>Clearly some powerful alien tool was handed around except it comes with no manual and everyone has to figure out how to hold it and operate it, while the resulting magnitude 9 earthquake is rocking the profession.</p>
</blockquote>
<p>As a writer but also data engineer, I find it most useful when it suggests notes and ideas I have forgotten about that are relevant to my current task at hand. Or a <strong>snippet of code</strong>.</p>
<h3 id="repeating-code-snippets-over-and-over">Repeating Code Snippets over and over</h3>
<p>How many times have we written an ingestion pipeline that does the same thing just for a different source? Written an incremental update pipeline, or a full load, or implemented Slowly Changing Dimensions (Type 2).</p>
<p>Wouldn&rsquo;t it be great to have a tool that helps us remember and suggest code that worked for a problem at hand? No wonder Windows has a built-in <a href="https://support.microsoft.com/en-us/windows/retrace-your-steps-with-recall-aa03f8a0-a78b-4b3e-b0a1-2eb8ac48701c" target="_blank" rel="noopener noreffer">Windows Recall</a> feature that takes snapshots of everything we do, so we can see and remember what we did. Google traces where we went on <a href="https://www.google.com/maps/timeline" target="_blank" rel="noopener noreffer">Google Maps Timeline</a>, and so on. Not saying all of these are good, but clearly there&rsquo;s a need for it.</p>
<h3 id="vibe-coding">Vibe Coding</h3>
<p>Mostly these tasks are called <strong>vibe coding</strong> these days. I believe that vibe coding is best when you have an existing framework present and it can extend it. E.g. your website skeleton that already has a pre-existing structure is much better than starting from scratch, especially maintainability-wise.</p>
<p>Also, the more it has to predict in the future, the more likely it will introduce errors, compared to you providing a big skeleton with all the needed files and just extending on functionality.</p>
<p>This is the same for data engineering too. Declarative Data Stack, YAML Engineer is exactly that. A well-designed YAML that has a powerful system in the backend can go a long way with an agentic and vibe-coded approach.</p>
<p>It&rsquo;s similar to <a href="https://martinfowler.com/articles/exploring-gen-ai/sdd-3-tools.html" target="_blank" rel="noopener noreffer">Spec Driven Development (SDD)</a>, which is when we write our instructions in <code>claude.md</code> and Claude or any AI agents implement this. Also what <a href="https://www.linkedin.com/posts/escoo_ive-been-writing-99-of-my-code-at-airbnb-activity-7419777912096120832-f4fh?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAABkA2pgBYM4xDO0z2ChYuxFhBfu4h7jp4Lo" target="_blank" rel="noopener noreffer">Esco Obong</a> said about what they do at Airbnb: the hard part is coming up with the spec, talking to business, etc. The coding part is the small part.</p>
<p>And this is also where the human is still dearly needed in my opinion. Human in the seat and config-driven development is what it comes down to with AI agents. Plus, AI models have a context limit. Sure, we humans do too, but we can think more across domains and understand intuitive things that might not work for a statistical model.</p>
<p>This shows how that works, and why Markdown is in the middle of everything. Not only for the knowledge, but also to build and develop things.</p>
<blockquote>
<p>[!tip] Lifehack for Prompting</p>
<p>Always keep it simple, because <strong>it&rsquo;s easy to make it complex</strong>. The true beauty lies in making it simple, which is something agents are not good at.</p>
</blockquote>
<h3 id="use-mcp">Use MCP</h3>
<p>A key was using MotherDuck MCP with a direct connection from Claude Code to the database while prompting the initial version. Claude could directly query the database and its columns to implement the actual web app (see the initial prompt <a href="https://github.com/sspaeti/obsidian-note-taking-assistant/blob/main/web-app/prompts/agents-webapp.md" target="_blank" rel="noopener noreffer">here</a>).</p>
<p>Meaning Claude (in my case) could just query the database, use <code>SHOW TABLES</code>, select them, and extract their data types. And more, learning about the content and graph relationships that I had built in the first part.</p>
<p>So Claude could easily build a first version based on my instructions and existing DuckDB database. I also shared the great docs to build <a href="https://motherduck.com/docs/key-tasks/customer-facing-analytics/3-tier-cfa-guide/" target="_blank" rel="noopener noreffer">Customer-Facing Analytics Guide in a (3-tier Architecture)</a>.</p>
<p>With that, I almost had my web app ready with a single <code>plan mode</code> prompt.</p>
<blockquote>
<p>[!example] Claude supports LSP now</p>
<p>As code editors do, Claude also supports LSP (Language Server Protocol). This helps speed up Claude to read the code more efficiently, doing lookups by jumping to references or definitions instead of searching its way through the code. It might also understand the code better as it has a language server to use.</p>
</blockquote>
<h2 id="conclusion">Conclusion</h2>
<p>Building this tool reminded me again how powerful DuckDB and MotherDuck are. It&rsquo;s a Swiss Army knife database that can handle unique tasks and simplify my note-taking by providing a serverless database for querying my embeddings.</p>
<p>Now I have a powerful tool to search for related notes when I need to solve a problem, or to find relevant notes in my own second brain. The hidden connections this tool surfaces are valuable only because they&rsquo;re my connections, my thinking, not just crawled information on the internet. And not only that, I can even provide a minimal but useful web app for you to search my public notes, too.</p>
<p>As for the AI agents that helped build it: they got me there faster, but only because I stayed in the loop. Let them run without direction, and you&rsquo;ll get a thousand lines solving the wrong problem. To me, the &ldquo;human&rdquo; architect is still needed.</p>
<hr>
<p><strong>Other implementations</strong> I have collected over the years or came across while building this that might be helpful if you want to build something similar.</p>
<p>If you have many more files and embeddings that need to be created, follow the <a href="https://blog.brunk.io/posts/similarity-search-with-duckdb/" target="_blank" rel="noopener noreffer">Using DuckDB for Embeddings and Vector Search</a> article that runs on the GPU, creating embeddings for 2.85M Wikipedia articles. He used the Arrow/GPU acceleration and batch inserts via Arrow.</p>
<p>Some more links and repos I found interesting:</p>
<ul>
<li><strong>Scalable Embeddings &amp; Vector Search</strong>
<ul>
<li><a href="https://blog.brunk.io/posts/similarity-search-with-duckdb/" target="_blank" rel="noopener noreffer">Using DuckDB for Embeddings and Vector Search</a>: Tutorial on GPU-accelerated vector search that created embeddings for 2.85M Wikipedia articles using Arrow batch inserts and HNSW indexing.</li>
</ul>
</li>
<li><strong>Local-First Search Tools for Markdown</strong>
<ul>
<li><a href="https://github.com/tobi/qmd" target="_blank" rel="noopener noreffer">qmd</a>: Tobias Lütke&rsquo;s CLI search engine combining BM25, vector search, and LLM re-ranking—all local via Ollama, works with plain markdown (no wikilinks needed).</li>
</ul>
</li>
<li><strong>Obsidian AI Assistants</strong>
<ul>
<li><a href="https://github.com/logancyang/obsidian-copilot" target="_blank" rel="noopener noreffer">Obsidian Copilot</a>: A popular Obsidian AI plugin (6.1k+ stars) with vault chat, agent mode, and image/PDF/web processing—no index required for basic search.</li>
<li><a href="https://www.youtube.com/watch?v=NSoKRYNlOls" target="_blank" rel="noopener noreffer">Chat with Your ENTIRE Obsidian Vault OFFLINE (YouTube)</a>: Video walkthrough of offline Obsidian vault chat with Claude 3 integration.</li>
</ul>
</li>
<li><strong>RAG Frameworks &amp; Libraries</strong>
<ul>
<li><a href="https://github.com/QuivrHQ/quivr" target="_blank" rel="noopener noreffer">Quivr</a>: YC-backed opinionated RAG framework (38.6k+ stars) supporting any LLM, any vectorstore, and any file type with YAML-configured workflows.</li>
<li><a href="https://github.com/traversaal-ai/lennyhub-rag" target="_blank" rel="noopener noreffer">LennyHub RAG</a>: Complete RAG implementation on 297 podcast transcripts with knowledge graph extraction, Qdrant storage, and interactive network visualization.</li>
</ul>
</li>
<li><strong>AI-Assisted Development in Production</strong>
<ul>
<li><a href="https://www.linkedin.com/posts/escoo_ive-been-writing-99-of-my-code-at-airbnb-activity-7419777912096120832-f4fh" target="_blank" rel="noopener noreffer">Esco Obong on AI Coding at Airbnb (LinkedIn)</a>: Airbnb engineer shares writing 99% of code with LLMs, noting that code is &ldquo;only a small part of the actual work.&rdquo;</li>
</ul>
</li>
<li><strong>My List of Obsidian Related RAGs</strong>: <a href="https://www.ssp.sh/brain/second-brain-assistant-with-obsidian-notegpt" target="_blank" rel="noopener noreffer">Second Brain Assistant with Obsidian</a></li>
</ul>
<hr>
<pre class=""><em>Full article published at <a href="https://motherduck.com/blog/obsidian-rag-duckdb-motherduck/" target="_blank" rel="noopener noreferrer">MotherDuck.com</a> - written as part of <a href="/services">my services</a></em></pre>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Just a Really Very Intelligent System from Iron Man&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>Also check out <a href="https://www.spicytakes.org/" target="_blank" rel="noopener noreffer">Spicy Takes</a> with lots of quotes from popular blogs, that get rated by their spiciness.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></description>
</item>
<item>
    <title>Arch Linux (Omarchy) — 8 Months Later: The Good, the Bad, and the Fixable</title>
    <link>https://www.ssp.sh/blog/linux-omarchy-the-good-bad-and-fixable/</link>
    <pubDate>Tue, 10 Feb 2026 21:31:17 &#43;0100</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/linux-omarchy-the-good-bad-and-fixable/</guid><enclosure url="https://www.ssp.sh/blog/linux-omarchy-the-good-bad-and-fixable/featured-image.png" type="image/png" length="0" /><description><![CDATA[<p>This is a follow-up to my part 1 of <a href="https://www.ssp.sh/blog/macbook-to-arch-linux-omarchy/" target="_blank" rel="noopener noreffer">Switching macOS to Arch Linux with Omarchy</a>, where I documented my first months with Arch Linux and [[Omarchy]], after switching from 15 years of using macOS and Windows on and off at work since 2003.</p>
<p>Back then, I had a checklist of basics I needed before I could commit to Linux as a daily driver: Obsidian, a Raycast-like launcher for fuzzy finding files and folders, screenshots (Snagit), daylight adjustment (f.lux), calendar events in the top bar. Those were quick wins.</p>
<p>Eight months later, I&rsquo;ve gone through many more challenges and learnings. In this post, I&rsquo;ll share which apps replaced my heavily integrated <a href="https://www.youtube.com/watch?v=sStKFOwNaSM" target="_blank" rel="noopener noreffer">macOS workflow</a>, what my <a href="https://www.youtube.com/watch?v=XOp8lngtmPg" target="_blank" rel="noopener noreffer">Omarchy workflow</a> looks like now, and — honestly — what still doesn&rsquo;t quite work.</p>
<h2 id="apps-that-replaced-my-macos-apps-on-linux">Apps that Replaced My macOS Apps on Linux</h2>
<p>Let&rsquo;s start with which apps and how I changed some of my workflow now in Linux.</p>
<p>Below list goes from complex Raycast replacement that was integrated into my whole workflow with search through files, calculator, emojis to calendar, daylight gamma correction for night sessions to PDF viewer that replaces Finder to sharing screen with Linux window picker, and much more.</p>
<p>It continues with running Windows on Linux with a simple install toggle and finding the right hardware, before I create a conclusion of these initial months using Linux full time for my business and also privately.</p>
<h3 id="app-launcher-and-raycast-replacement-fuzzy-search-file-search-clipboard-math-and-so-on">App Launcher and Raycast Replacement: Fuzzy Search, File Search, Clipboard, Math, and so on</h3>
<p>One of the first apps to replace that most have, and that I also used, is <strong>Raycast</strong>. It&rsquo;s an app I couldn&rsquo;t live without, not only for the fuzzy finder but also for quick calculations, searching files, and clipboard manager.</p>
<p>With <strong>[[Walker Launcher]]</strong> I found the perfect replacement which has this all included and works like a charm.</p>













  
<figure><a target="_blank" href="/blog/linux-omarchy-the-good-bad-and-fixable/img_Walker_launcher_1760944046142.webp" title="">

</a><figcaption class="image-caption">Functions of Walker available with <code>/</code> | See my <a href="https://x.com/sspaeti/status/1979916427583742344" target="_blank" rel="noopener noreffer">Tweet</a> for more information.</figcaption>
</figure>
<p><strong>Search file content</strong> with spotlight - Find files with Walker with built-in preview: ![[img_Switched from macOS to Linux- 6 months in_1770740249453.webp]]</p>
<p>Opening its containing folder or file with the default program. This is how I search and find anything compared to manually browsing through file explorer. Find any files within seconds with built-in search of Walker (Before I found Walker, I built <a href="https://github.com/sspaeti/dotfiles/blob/master/hypr/.config/hypr/sspaeti/fuzzy-file-content.sh" target="_blank" rel="noopener noreffer">my own one</a>).</p>
<p><strong>Emojis</strong> quick search. It comes with Walker built-in too, but I have my own script so I can find emojis faster as I can change the search term. Very <a href="https://github.com/sspaeti/dotfiles/blob/master/hypr/.config/hypr/sspaeti/emoji-fuzzy.sh" target="_blank" rel="noopener noreffer">simple, but powerful</a><br>
![[img_Switched from macOS to Linux- 6 months in_1770741956438.webp]]</p>
<p><strong>Clipboard managers</strong>, of which there <a href="https://github.com/savedra1/clipse" target="_blank" rel="noopener noreffer">are</a> <a href="https://github.com/sentriz/cliphist" target="_blank" rel="noopener noreffer">several</a>, but Walker comes with one built-in too. Including <strong>search</strong> and <strong>image preview</strong>:</p>













  
<figure><a target="_blank" href="/blog/linux-omarchy-the-good-bad-and-fixable/img_Switched%20from%20macOS%20to%20Linux-%206%20months%20in_1770742102558.webp" title="">

</a><figcaption class="image-caption">Clipboard on opening, with search and image preview.</figcaption>
</figure>
<p>Other dedicated clipboard managers are <a href="https://github.com/sentriz/cliphist" target="_blank" rel="noopener noreffer">cliphist</a> or <a href="https://github.com/savedra1/clipse" target="_blank" rel="noopener noreffer">Clipse</a>. There are also other Raycast-compatible launchers for Linux such as <a href="https://github.com/ByteAtATime/flare" target="_blank" rel="noopener noreffer">flare</a>, Rofi, and many more.</p>
<h3 id="keyboard-shortcuts-and-quick-symbols">Keyboard Shortcuts and Quick Symbols</h3>
<p>I used <a href="https://github.com/jtroo/kanata" target="_blank" rel="noopener noreffer">Kanata</a> for integration of advanced features to switch between my keyboards and some of the advanced use cases such as using CAPS LOCK for vim-like movements. I use <code>caps + hjkl</code> to move left, down, up and right with the respective arrow keys as almost all programs work with arrow keys. Also F1-F12 functions with <code>caps+1</code> for F1.</p>
<p>For simple replacements, I used XCompose to write Umlauts (<code>äöü</code> and special symbols <code>—«»</code> and more). I used Karabiner-Elements heavily, and Kanata solved it for me, see my configs at <a href="https://github.com/sspaeti/dotfiles/blob/master/kanata/.config/kanata/kinesis.kbd" target="_blank" rel="noopener noreffer">dotfiles.ssp.sh/kanata</a>.</p>
<h3 id="backups-and-data-sync">Backups and Data Sync</h3>
<p>Time machine on macOS was great. I used sync.com for dropbox-like sync on macOS too. Neither worked on Linux. So I switched to <a href="https://filen.io/" target="_blank" rel="noopener noreffer">Filen</a>, which has a similar setup and stores the data encrypted, and hosted in Germany. I&rsquo;m using Stow for all my dotfiles stored in Git. It&rsquo;s great, check them at <a href="https://dotfiles.ssp.sh" target="_blank" rel="noopener noreffer">dotfiles.ssp.sh</a>.</p>
<p>I back up my images, personal documents, or scripts also with rsync-scripts to save to my homeserver and encrypted drive on Vultr. See more <a href="/blog/self-host-self-independence/" rel="">Tech Independence</a>.</p>
<p>I also looked at NextCloud for hosting it myself, but for now I just need something that works. As Filen is an Electron app, it just works everywhere.</p>
<h3 id="calendar">Calendar</h3>
<p>Calendar is one thing everyone uses, and I used Cron Calendar (later acquired by Notion) a lot, and wanted a good replacement for Linux. Though I use <a href="https://calendar.google.com" target="_blank" rel="noopener noreffer">calendar.google.com</a> often on the web.</p>
<p>But the best replacement I found was <a href="https://morgen.so/" target="_blank" rel="noopener noreffer">Morgen</a> (built in Switzerland) and is made for Linux first. It has a great preview inside the top bar too and timezones built-in.</p>
<p>![[img_Switched from macOS to Linux- 6 months in_1770740582450.webp]]</p>
<p>Time zones can also be activated by hovering on the time on the left:<br>
![[img_Switched from macOS to Linux- 6 months in_1770740650778.webp]]</p>
<h3 id="daylight-and-gamma-light-adjustment">Daylight and Gamma light Adjustment</h3>
<p>Sunlight adjustment like <a href="https://justgetflux.com/" target="_blank" rel="noopener noreffer">f:lux</a>. Omarchy comes with one included right now, but I also used <code>wlsunset</code> with <code>wlsunset -l 47.4095 -L 8.5514 -t 3500 -T 6500</code>, that does the job well.</p>
<h3 id="hibernation-and-suspending-computer">Hibernation and Suspending Computer</h3>
<p>Hibernation and suspending is something that you take for granted on other operating systems. But on Linux it&rsquo;s trickier, so it didn&rsquo;t work out of the box. In the meantime, it comes built into Omarchy.</p>
<h3 id="presenting-with-external-projectors-and-screens">Presenting with External Projectors and Screens</h3>
<p>Presentations and recognition of presenters and screens. I only had one presentation, but tried many external monitors, and Hyprland (which is responsible for recognizing screens <code>hyprctl monitors</code>) works just like macOS by auto-recognizing them.</p>
<p>Even better, I have shortcuts to make them automatically align at the right position. Or use <a href="https://github.com/erans/hyprmon" target="_blank" rel="noopener noreffer">hyprmon</a> (one of the great [[TUIs]]) when I need to do it manually.</p>
<h3 id="pdf-merger">PDF Merger</h3>
<p><a href="https://github.com/pdfarranger/pdfarranger" target="_blank" rel="noopener noreffer">PDF Arranger</a> for merging multiple PDFs into one or rotating pages of a PDF. It&rsquo;s open source and better than macOS Preview.</p>
<h3 id="need-anything-more-just-build-vibe-code-it-yourself">Need Anything More? Just Build (vibe code) it Yourself</h3>
<p>If you need something, you just build it with [[Claude Code]] and integrate it into your laptop.</p>
<p>No need to ask Mr. Bill Gates or Tim Cook to integrate it. For example, I needed an <strong>edge light for video calls</strong>, or saw someone who had this. I liked it, not that I really needed it (but one day it might be helpful 😀). I&rsquo;ve built a <a href="https://github.com/sspaeti/wayland-edge-light-videocalls" target="_blank" rel="noopener noreffer">small custom tool</a> that works out of the box with Hyprland for my future video calls.</p>













  
<figure><a target="_blank" href="/blog/linux-omarchy-the-good-bad-and-fixable/img_Switched%20from%20macOS%20to%20Linux-%206%20months%20in_1770735368009.webp" title="">

</a><figcaption class="image-caption">Check it out at <a href="https://github.com/sspaeti/wayland-edge-light-videocalls" target="_blank" rel="noopener noreffer">wayland-edge-light-videocalls</a></figcaption>
</figure>
<h3 id="screen-sharing-works-differently">Screen Sharing Works Differently</h3>
<p>For example, screen sharing is not as straightforward because you get a very old frame to pick your windows or output or regions. On top of that, you usually need to pick twice and only the second pick will count. This was very confusing and I documented a fix and how it looks at [[Screen Sharing on Wayland (hyprland) with Chrome]].</p>
<p>But with the latest updates of Omarchy, that has also been solved and it works out of the box and looks beautiful now:</p>













  
<figure><a target="_blank" href="/blog/linux-omarchy-the-good-bad-and-fixable/img_Switched%20from%20macOS%20to%20Linux-%206%20months%20in_1770754480836.webp" title="">

</a><figcaption class="image-caption">Compared to the default DOS screen picker, this is beautiful, or just modern.</figcaption>
</figure>
<h3 id="others-virtual-envs-remote-desktop-and-adding-printers">Others: Virtual Envs, Remote Desktop and Adding Printers</h3>
<p>For <strong>virtual environments</strong>, I&rsquo;m using Mise, as it comes pre-installed on Omarchy. Before, I used <code>asdf</code>.</p>
<p>Remote desktop to virtual desktops works great with <code>xfreerdp3</code>, which connects well to the Windows VM.</p>
<p>Need to import images from camera? Not as UI-driven as on Mac or Windows, but amazingly simple and fast with gphoto, see [[Import Files on Arch Linux (gphoto)]].</p>
<p>Adding printers might be needed at some point. This can be done UI-driven with system-config-printer - CUPS configuration tool. Or do it the terminal way with <code>lpadmin</code>, see [[Adding Printer on Linux]] for more information.</p>
<h2 id="running-microsoft-windows-inside-linux">Running Microsoft Windows inside Linux</h2>
<p>A big one is to run another operating system, in this case Windows, as part of your OS. E.g. I use Microsoft Office often, so I can quickly start up Windows with Office when needed.</p>
<p>The best part is that it uses only Docker, meaning easy setup, separated from my configs. It&rsquo;s a single-click setup taking 15 seconds.</p>
<p>Installing and integrating seamlessly in a Docker VM works <a href="https://learn.omacom.io/2/the-omarchy-manual/100/windows-vm" target="_blank" rel="noopener noreffer">superbly with Omarchy</a>. I submitted a <a href="https://github.com/basecamp/omarchy/pull/1333" target="_blank" rel="noopener noreffer">PR to Omarchy</a> to make this available to everyone. The built-in version in Omarchy (the first version) was done by me, and it was merged into core and is now available to everyone.</p>













  
<figure><a target="_blank" href="/blog/linux-omarchy-the-good-bad-and-fixable/windows-omarchy-vm.webp" title="">

</a><figcaption class="image-caption"><a href="https://x.com/sspaeti/status/1978823118270390642" target="_blank" rel="noopener noreffer">Tweet</a> and thanks from <a href="https://x.com/dhh/status/1978826791792918724" target="_blank" rel="noopener noreffer">DHH</a> himself.</figcaption>
</figure>
<p>It&rsquo;s now easier to run Windows on Linux than natively on a Windows machine 😉.</p>
<div class="details admonition info open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-info"></i>Many ways of integrating: Omarchy uses Dockur<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content"><p>There are many options:</p>
<ul>
<li><a href="https://github.com/dockur/windows" target="_blank" rel="noopener noreffer">dockur/windows</a>: Windows inside a Docker container (used in Omarchy).</li>
<li><a href="https://github.com/winapps-org/winapps" target="_blank" rel="noopener noreffer">Winapps</a>: Run Windows apps such as Microsoft Office/Adobe in Linux (Ubuntu/Fedora) and GNOME/KDE as if they were a part of the native OS, including Nautilus integration.</li>
<li><a href="http://winboat.app/" target="_blank" rel="noopener noreffer">WinBoat</a>: an easier version than Winapps:  - Run Windows Apps on Linux with Seamless Integration.</li>
</ul></div>
        </div>
    </div>
<h2 id="finding-the-right-hardware-the-reasons-why-not-to-switch">Finding the Right Hardware: The Reasons why not to Switch</h2>
<p>Everyone knows the stereotypes about Linux. WiFi won&rsquo;t work, Bluetooth won&rsquo;t connect, constant interruptions. And beyond that, there&rsquo;s the hardware fear, that you simply can&rsquo;t match what Apple offers. A common sentiment:</p>
<blockquote>
<p>I&rsquo;m currently with this dilemma. I&rsquo;m an experienced Linux user, but over the years gravitated towards Macs (especially M-series) and unfortunately they do make better hardware, at least for my use 🙈 I just <em>can&rsquo;t</em> move to a machine with a much worse battery life, display, webcam, speakers etc. I know some good Linux-friendly laptops exist, but it&rsquo;s still a downgrade, for me. If someone made better hardware, I&rsquo;d probably jump over right away. <a href="https://x.com/DenLoginoff/status/2021079777608614290" target="_blank" rel="noopener noreffer">Tweet</a></p>
</blockquote>
<p>I thought the same. The great keyboard, camera, speakers, trackpad, battery. Apple just nails the whole package. But what I found is that I didn&rsquo;t actually have to downgrade.</p>
<p>I started with a <strong>Lenovo ThinkBook 14 G7 ARP (AMD)</strong> with 32 GB RAM. Great build quality, beautiful look, and the keyboard surprised me, with much more travel and grip than the MacBook. See more on <a href="https://www.ssp.sh/blog/macbook-to-arch-linux-omarchy/#choosing-the-hardware" target="_blank" rel="noopener noreffer">Part 1</a>.</p>
<p>Once I realized this would become my daily driver, I searched for something more powerful for data engineering work and landed on a <strong>Tuxedo InfinityBook Pro 14 Gen10 AMD</strong> with 128 GB (!!) RAM, an AMD Ryzen AI 9 HX 370, and AMD Radeon 890M.</p>













  
<figure><a target="_blank" href="/blog/linux-omarchy-the-good-bad-and-fixable/img_Switched%20from%20macOS%20to%20Linux-%206%20months%20in_1770739390246.webp" title="">

</a><figcaption class="image-caption">My Tuxedo InfinityBook Pro 14 Gen10 AMD, with 128 GB (!!) RAM, AMD Ryzen AI 9 HX 370 , and AMD Radeon 890M.</figcaption>
</figure>
<p>First impressions: super smooth, even snappier than the Lenovo. Obsidian and other apps feel a tiny bit faster. The 3K 500-nit display is stunning. Crisp, bright, better than my external 4K monitor. And it&rsquo;s <strong>matte</strong>, which I&rsquo;d forgotten I actually prefer. It works outside, no glare. The Lenovo&rsquo;s anti-glare screen was equally great in that regard.</p>
<p>The keyboard has less travel than the Lenovo ThinkBook (which I really loved), more like a MacBook, which is fine but feels slightly cheap. I mostly use external keyboards anyway. The fingerprint reader is also missing, which I&rsquo;d grown to love on both MacBooks and the Lenovo, where it worked flawlessly on Linux. The trackpad is smooth and great to work with daily, though palm detection caused some cursor jumping in the first days, not as good as Apple&rsquo;s, but perfectly usable.</p>
<p>Battery life was a pleasant surprise. My Tuxedo (80Wh) delivered battery life comparable to my M1 Max MacBook. I spent a whole afternoon in the library and it was still above 70%.</p>
<p>With Omarchy, everything just worked out of the box. No WiFi or Bluetooth issues, speakers with sound, all good.</p>
<p>But it&rsquo;s not perfect, by far. I get some <a href="https://www.reddit.com/r/tuxedocomputers/comments/17pzcet/strange_popping_sounds_coming_from_my_laptop/" target="_blank" rel="noopener noreffer">strange popping sounds from the laptop</a>, mostly after hibernating once or twice. Not sure why, and it probably shouldn&rsquo;t happen. There are many other [[Notebook &amp; Desktops for Linux]] to choose from, Framework being one, but choosing is still tricky, as chipset and GPU support on Linux matters, and you want something state-of-the-art.</p>
<p>Another side effect, as <a href="https://x.com/KevinNaughtonJr/status/2021009900097483120" target="_blank" rel="noopener noreffer">Kevin says</a> of not having an expensive Macbook:</p>
<blockquote>
<p>My favorite part aside from customization is just that i don&rsquo;t care about my machine at all: it gets lost? breaks? stolen? i get a new machine, run 1 command and everything is back exactly as i left it. Macs on the other hand are expensive to buy and repair which makes people worry and worry = less peace of mind.</p>
</blockquote>
<blockquote>
<p>[!example] Follow the evolution on Social Media</p>
<p>The whole story I documented in threads on <a href="https://x.com/sspaeti/status/1942502383923134464" target="_blank" rel="noopener noreffer">Twitter</a> and on <a href="https://bsky.app/profile/ssp.sh/post/3lug5oijnjc22" target="_blank" rel="noopener noreffer">Bluesky</a>, follow these to see the history and events in they happened.</p>
</blockquote>
<h2 id="conclusion-of-using-linux-for-8-months">Conclusion of Using Linux for 8+ Months</h2>
<p>After using Windows since 2003 and macOS for more than 15 years, how do I feel after 8 months on Linux?</p>
<p><strong>Things mostly work great</strong>, but need a little tinkering to begin with, or work differently. The biggest difference, which I like a lot, is a more terminal-native workflow. Closer to the command line. Using lots of [[TUIs]].</p>
<p><strong>When I started</strong>, I just wanted the same as I had on macOS. After getting familiar with the new environment, with all the small utilities, tools and programs Linux has, I got many more tools to choose from. Sometimes much better, though terminal-based, but fast and direct. Sometimes you obviously miss a tool that has no replacement (for me still Snagit).</p>
<p>Besides the obvious (terminal-native, best-in-class Tiling Window Manager with Hyprland, no-latency navigation), there&rsquo;s something harder to put into words. The OS is what we use every day, so when you can quickly fix or change a small thing to give you more joy or more productivity, it might just put a smile on your face whenever you use that feature. At least it still does for me. And since all my configs live in <a href="https://dotfiles.ssp.sh" target="_blank" rel="noopener noreffer">dotfiles</a> and my data syncs externally via Filen and Obsidian, setting up a new machine is a single command.</p>
<h3 id="what-i-thought-id-miss-vs-what-i-actually-miss">What I Thought I&rsquo;d Miss vs. What I Actually Miss</h3>
<p>Before I switched, I thought I&rsquo;d miss all my <a href="https://setapp.com/" target="_blank" rel="noopener noreffer">Setapp</a>, my MacBook hardware, and the stability of just working. Most of my apps work on my new machine and even better software-wise, so I&rsquo;m still quite happy to have made the switch, even more so watching macOS get slower with each install without real benefit (looking at the Liquid Glass update) and Windows stuffing Copilot into Notepad and recording your screen with Recall.</p>
<p>What I <strong>actually</strong> miss are simpler things: the <strong>stability</strong> of having calls everywhere all the time with Apple reliability and inbuilt mic/speaker/camera. A crash because the GPU is fails<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> after hibernation right before an important meeting. Hibernation and suspending was quite a battle to get working, but it seems to just work now.</p>
<p>What I like about Linux: it might not work out of the box for every laptop or every program, but you can actually fix it, and from that moment you know the problem, you learned something about computers, and the error will not appear again. Unlike other operating systems that change stuff you set in settings for a reason, only to learn that certain updates turned that checkmark back on.</p>
<h3 id="tinkering-and-troubleshooting-not-everything-just-works">Tinkering and Troubleshooting: Not Everything Just Works</h3>
<p>Probably, without [[Claude Code]], I wouldn&rsquo;t have made the switch, or I would have made it, but probably wouldn&rsquo;t have stayed. When something happens, e.g., a crash out of nowhere, I just open Claude and say: I had a crash, I am running Arch Linux, please check the logs what went wrong. And what I get is a full analysis of what went wrong, some fixes and suggestions. Knowing that Linux has 100 different setups, different drivers for every hardware, this is a non-negotiable lifesaver.</p>
<p>With Linux you also have to troubleshoot bugs, but at least it&rsquo;s free software and open source, and honestly, they seem even less frequent than with commercial, paid products these days.</p>
<p>If you&rsquo;ve read this far, thank you. What&rsquo;s your experience, are you thinking about switching, or already on Linux? Let me know anywhere on <a href="https://bsky.app/profile/ssp.sh/post/3melam6gxrf2m" target="_blank" rel="noopener noreffer">Bluesky</a>, or <a href="https://x.com/sspaeti/status/2021528330324086934" target="_blank" rel="noopener noreffer">Twitter</a>.</p>
<hr>
<p>Again, if you want to watch a full video workflow, check my short video about it: <a href="https://www.youtube.com/watch?v=XOp8lngtmPg" target="_blank" rel="noopener noreffer">Omarchy Arch Tiling Window Workflow (macOS comparison) - YouTube</a>. Or <a href="https://www.youtube.com/watch?v=sStKFOwNaSM" target="_blank" rel="noopener noreffer">my macOS workflow</a> as a comparison and what I switched from.</p>
<h2 id="appendix-troubleshooting-and-things-that-didnt-work-so-well-or-i-had-already-fixed">Appendix: Troubleshooting and Things That Didn&rsquo;t Work So Well, or I Had Already Fixed</h2>
<p><strong>GPU Crashes (AMD Radeon 890M).</strong> This was the biggest recurring issue. The GPU&rsquo;s MES (Micro Engine Scheduler) would become unresponsive and crash the entire system, triggered by Brave browser, Google Meet video calls, and Kdenlive video encoding. The root cause is that the Radeon 890M (gfx1150/RDNA 3.5) is still very new, and driver support on bleeding-edge kernels (6.17–6.18) is immature. Solutions included disabling hardware acceleration in Brave (<code>brave://settings/system</code>), adding kernel parameters (<code>amdgpu.gpu_recovery=1 amdgpu.noretry=0 amdgpu.ip_block_mask=0xfffff7ff</code>), and considering the LTS kernel as fallback. The community is tracking this on <a href="https://community.frame.work/t/amd-gpu-mes-timeouts-causing-system-hangs-on-framework-laptop-13-amd-ai-300-series/71364" target="_blank" rel="noopener noreffer">Framework forums</a> and <a href="https://gitlab.freedesktop.org/drm/amd/-/issues/3067" target="_blank" rel="noopener noreffer">AMD&rsquo;s GitLab</a>.</p>
<p><strong>Keyboard Freezing After Suspend (Tuxedo).</strong> The internal keyboard would stop working after suspend/resume cycles due to a firmware bug in the keyboard controller (i8042). Fixed by adding <code>i8042.nomux=1 i8042.reset=1 i8042.noloop=1 i8042.nopnp=1</code> to the kernel command line in <code>/etc/default/limine</code> and regenerating the UKI with <code>sudo mkinitcpio -P</code>. Shared the solution on <a href="https://sh.reddit.com/r/tuxedocomputers/comments/1ndq7vw/comment/ne5kjob/" target="_blank" rel="noopener noreffer">Reddit</a>.</p>
<p><strong>Hibernation Not Resuming.</strong> After suspend-then-hibernate (triggered by closing the lid), the laptop wouldn&rsquo;t resume and required a fresh boot. The cause was missing <code>resume=</code> and <code>resume_offset=</code> kernel parameters. Omarchy&rsquo;s hibernation setup script added the mkinitcpio hook but never added the actual kernel parameters. Fixed by calculating the swap offset (<code>sudo btrfs inspect-internal map-swapfile -r /swap/swapfile</code>) and adding <code>resume=/dev/mapper/root resume_offset=&lt;offset&gt;</code> to <code>/etc/default/limine</code>. Documented the fix in <a href="https://github.com/basecamp/omarchy/issues/4259#issuecomment-3804954054" target="_blank" rel="noopener noreffer">this Omarchy issue</a>.</p>
<p><strong>Thermal Throttling (Lenovo).</strong> The Lenovo ThinkBook would hit 99°C and become unusable during video calls. Turned out the bottom intake vents were blocked when the laptop sat flat on a desk. Simply elevating the laptop dropped temps to 73–77°C and performance was completely fine, even running stress tests while screen sharing. A laptop stand solved it permanently.</p>
<p><strong>WiFi Speed Drops (Tuxedo, Intel AX210).</strong> Speeds dropped to 2–72 Mbps after a system update. Root cause was a bug in <code>linux-firmware-intel</code> version 20251125 that caused the Intel AX210 card to negotiate very low RX bitrates. Fixed by downgrading to the October firmware (<code>sudo pacman -U /var/cache/pacman/pkg/linux-firmware-intel-20251021-1-any.pkg.tar.zst</code>), disabling WiFi power save permanently via a systemd service, and adding <code>IgnorePkg = linux-firmware-intel</code> to <code>/etc/pacman.conf</code> until a proper fix ships.</p>
<p><strong>Keyring/Brave Re-login on Every Boot.</strong> Brave asked for login credentials after every reboot because the gnome-keyring file kept getting corrupted. This was caused by SDDM autologin. Without entering a password at login, PAM can&rsquo;t unlock the keyring. The ultimate fix was launching Brave with <code>--password-store=basic</code> in the autostart config. Documented in <a href="https://github.com/basecamp/omarchy/discussions/3523#discussioncomment-15286162" target="_blank" rel="noopener noreffer">this Omarchy discussion</a>.</p>
<p><strong>Sudoers Misconfiguration.</strong> While adding a NOPASSWD rule for a keyboard-switching script, I accidentally broke sudo access entirely. Had to boot into recovery/single-user mode to fix <code>/etc/sudoers</code>. Lesson learned: always have <code>sspaeti ALL=(ALL) ALL</code> as a separate line and be very careful with <code>visudo</code>. Documented the recovery process in an emergency recovery guide.</p>
<p><strong>Screen Recording VFR Issues.</strong> Omarchy&rsquo;s screen recorder (<code>gpu-screen-recorder</code>) produces variable frame rate videos by default, which Kdenlive can&rsquo;t edit properly. The fix is adding <code>-fm cfr</code> to the recording command. Additionally, Kdenlive&rsquo;s VAAPI hardware transcoding crashed the GPU (same MES issue), so software encoding (<code>libx264</code>) is needed for now.</p>
<p><strong>CPU Fan Noise When Plugged In.</strong> The system switched to &ldquo;performance&rdquo; CPU governor when plugged in, causing constant full-speed fans even at low load. Fixed via <code>powerprofilesctl set balanced</code> or through Omarchy&rsquo;s built-in power settings menu.</p>
<!-- ### Building yourself: Fuzzy image Finder -->
<!-- Fuzzy find my images, a tool I built as I couldn't find a replacement for Snagit. -->
<!-- [[Horizontal and Vertical Cut Out]] that Snagit provides does not work on Omarchy. I had a workaround with GIMP Horizontally and Vertically crop out, but now Editt is supporting it. [Editt](https://github.com/mirarr-app/editt) also supports horizontal cut out now. --> 
<!-- I use Satty for simple screenshotting, sometimes Figma for more advanced workflow, GIMP Horizontally and Vertically crop out for a workaround. I'm also using [Using FireShot](https://getfireshot.com/using.php#using) inside the browser for scrollable images. And there's a full list at [List of tools](https://wiki.archlinux.org/title/Screen_capture#Screenshot_software). Still trying to find the old Snagit workflow, but getting there. Note: If you are not yet on Linux, but on macOS or Windows, buy a one-time licence for Snagit and be happy ever after if you take screenshots. You can thank me later :). Ksnip has vertical and horizontal cut out too, but does not work on Wayland. -->
<!-- I built an image search first: [image-browser](https://github.com/sspaeti/dotfiles/tree/master/hypr/.config/hypr/sspaeti/image-browser), see <a href="img_Switched from macOS to Linux- 6 months in_1770740977506.webp">my image search tool</a>. --> 
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>In my case, I have a newer GPU, which suddenly shuts down because of not having all the fixes released in the drivers and software. Depending on your hardware, you might be more or less lucky.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></description>
</item>
</channel>
</rss>
