<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
    <channel>
        <title>Data Engineering Blog</title>
        <link>https://www.ssp.sh/</link>
        <description>Genuine News About the Data Ecosystem</description>
        <generator>Hugo -- gohugo.io</generator><language>en</language><managingEditor>hello@sspaeti.com (Simon Späti)</managingEditor>
            <webMaster>hello@sspaeti.com (Simon Späti)</webMaster><copyright>All rights reserved. Sharing of excerpts with proper attribution is encouraged for non-commercial purposes. For commercial use or republication, please contact hello@sspaeti.com.</copyright><lastBuildDate>Thu, 14 May 2026 00:08:08 &#43;0200</lastBuildDate>
            <atom:link href="https://www.ssp.sh/index.xml" rel="self" type="application/rss+xml" />
        <item>
    <title>Internal vs. External Storage? What&#39;s the Limit of External Tables</title>
    <link>https://www.ssp.sh/blog/modern-external-tables-and-evolution/</link>
    <pubDate>Thu, 14 May 2026 00:08:08 &#43;0200</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/modern-external-tables-and-evolution/</guid><enclosure url="https://www.ssp.sh/blog/modern-external-tables-and-evolution/featured-image.png" type="image/png" length="0" /><description><![CDATA[<p>When I started my career as a data warehouse engineer and business intelligence engineer in 2003, external tables with materialized views were the standard. We used external tables to integrate CSV files and other data not already in Oracle databases. Oracle External Tables have existed since 2001, and that&rsquo;s where I first used them. If the Lindy Effect continues to hold, we&rsquo;ll use external tables even longer. But why have they survived for so long?</p>
<p>The core question is: &ldquo;When should you store data internally in your warehouse versus externally in object storage?&rdquo;. Hot data queried frequently goes inside. Cold archival data stays external, where it&rsquo;s cheaper but slower. Interestingly, Databricks and BigQuery recently added external table features, but why? Not because they&rsquo;re trendy, but because the economics still work.</p>
<p>This article offers an inside look at external tables, their 25-year history, how they evolved from CSV parsers to ACID lakehouse tables, and whether you need to know about them today.</p>
<h2 id="what-are-external-tables">What Are External Tables?</h2>
<p>So what are external tables, and why have we been using them for so long? Why don&rsquo;t we just use the internal storage of a database?</p>
<p>In Oracle, where I first used them in 2008, they allowed you — and still do — to access data in external tables. External tables are defined as <strong>tables that do not reside in the database</strong>, and can be in any format for which an access driver<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> is provided. All of this is provided via <a href="https://en.wikipedia.org/wiki/Data_definition_language" target="_blank" rel="noopener noreffer">DDL</a> (Data Definition Language) of the database, describing an external table with all its columns, data types, etc., exposing the data as if it were residing in a regular database table.</p>
<p>The external data can be queried in parallel and <strong>queried directly using SQL</strong>. Essentially, it&rsquo;s read-only access to data stored outside of our database, making it available in a tabular, easy-to-work-with format to interact with existing tooling and language. In 2008, this was through procedural language such as PL-SQL in Oracle or T-SQL on MSSQL.</p>
<p>Today, external tables have evolved. The biggest change is that they can read more formats including semi-structured data such as Parquet, JSON, Avro, and ORC. While CSV was readable in 2008, the difference today is the columnar formats and nested formats that enable faster analytics. These are available for downstream processes and dashboards, but mostly accessed through SQL queries in one form or another.</p>
<p>A modern definition by <a href="https://research.google/pubs/biglake-bigquerys-evolution-toward-a-multi-cloud-lakehouse/" target="_blank" rel="noopener noreffer">BigLake</a>, an evolution of BigQuery toward a multi-cloud lakehouse that tries to solve key customer requirements around the unification of data lake and enterprise data warehousing workloads, <a href="https://docs.cloud.google.com/bigquery/docs/external-tables" target="_blank" rel="noopener noreffer">introducing</a> external tables in 2015 as part of it<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>:</p>
<blockquote>
<p>External tables are stored outside of BigQuery storage and refer to data that&rsquo;s stored outside of BigQuery. [..] Google Non-BigLake external tables let you query structured data in external data stores. To query a non-BigLake external table, you must have permissions to both the external table and the external data source.</p>
</blockquote>
<p>Snowflake <a href="https://docs.snowflake.com/en/sql-reference/sql/create-external-table" target="_blank" rel="noopener noreffer">defines</a> them as:</p>
<blockquote>
<p>[&hellip;]  When queried, an external table reads data from a set of one or more files in a specified external stage, and then outputs the data in a single VARIANT column. Additional columns can be defined, with each column definition consisting of a name, data type, and optionally whether the column requires a value (NOT NULL) or has any referential integrity constraints.</p>
</blockquote>
<p>External tables were <a href="https://www.snowflake.com/en/blog/external-tables-are-now-generally-available-on-snowflake/" target="_blank" rel="noopener noreffer">added in 2021</a>, and Snowflake described their benefits as follows:</p>
<blockquote>External Tables Address Key Data Lake Challenges:
<ol>
<li>To <strong>augment an existing data lake</strong>. [..] augment their existing data lake, rather than replace it. The External Tables feature enables that use case. Customers can use external tables to query the data in their data lake without ingesting it into Snowflake. (side note: MVs<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup>)</li>
<li>Ad-hoc analytics. Customers often use external tables to <strong>run ad-hoc queries directly on raw data before ingesting the data</strong> into Snowflake. Ad-hoc queries help them evaluate data sets and determine further actions.</blockquote></li>
</ol>
<div class="mermaid" id="id-7"></div>
<h3 id="just-a-pointer-symlink">Just a Pointer (Symlink)?</h3>
<p>A simple analogy is a <strong>symlink in Linux</strong>, where you point from your current directory to another directory without moving data. You just add a pointer. If you read that file from that symlink, all it does is read it from the location the symlink points to.</p>
<p>An external table is the same, just a <strong>pointer</strong> to external data, bringing that data into the current data warehouse or cloud solution, hence the word external. You define the source format such as XML, CSV, etc., and define their structure, and then you can query that at any time. It&rsquo;s similar to a SQL View in that sense, but pointing to non-internal data.</p>
<p>Running <code>DROP TABLE</code> and deleting an external table is metadata-based only. No data is removed, only the table definition from the internal data catalog. The same is true with a symlink. Almost any relational database today has support for it, even if it&rsquo;s not called an external table. Everyone occasionally needs to read data outside of its warehouse or database.</p>
<h2 id="recap-in-the-history-of-external-tables">Recap in the History of External Tables</h2>
<p>Looking back at the history and evolution of external tables, we can quickly see that there&rsquo;s a long history and they&rsquo;ve been a <strong>recurring pattern</strong> across every generation of database technology since the early 2000s, and arguably longer if you count IBM&rsquo;s federated database concepts from the late 1990s.</p>
<div class="mermaid" id="id-8"></div>
<h3 id="the-origin-story-iso-in-2001">The Origin Story: ISO in 2001</h3>
<p>The history starts with <a href="https://www.iso.org/standard/31370.html" target="_blank" rel="noopener noreffer">ISO/IEC 9075-9</a>, published in 2001. Part 9 of the SQL standard defined foreign-data wrappers and datalink types for managing external data from within SQL. The work was completed in late 2000 and published alongside SQL:1999, with full integration in SQL:2003 (it was later <a href="https://www.iso.org/standard/84804.html" target="_blank" rel="noopener noreffer">updated in 2023</a>).</p>
<p>It was the initial definition and extensions to database language SQL to support management of external data <strong>through the use of foreign-data wrappers and datalink types</strong>.</p>
<p>My first encounter was with Oracle external tables, but according to <a href="https://en.wikipedia.org/wiki/Open_Database_Connectivity" target="_blank" rel="noopener noreffer">Wikipedia</a> there were earlier implementations, such as <strong>Microsoft Access linked tables (~1992)</strong>. Microsoft Access linked tables (~1992) were the earliest consumer-facing implementation where users could link dBASE, Paradox, text files, and ODBC sources as if they were Access tables. <strong>ODBC 1.0 (1992)</strong> itself established the first standard for heterogeneous data access across databases, though it didn&rsquo;t create table abstractions.</p>
<p>Further, <strong><a href="https://www.mcpressonline.com/analytics-cognitive/db2/the-as400-and-ibms-db2-datajoiner" target="_blank" rel="noopener noreffer">IBM&rsquo;s DB2 DataJoiner</a> (~1995)</strong> was more ambitious with a middleware product enabling SQL queries across Oracle, Sybase, SQL Server, Informix, Teradata, and even VSAM files through a unified interface. With <strong>SQL Server 7.0&rsquo;s Linked Servers (1998)</strong> we got federated querying to Microsoft&rsquo;s ecosystem via <strong>OLE DB</strong>, supporting cross-database joins with four-part naming conventions.</p>
<p>Most of these implementations shared a common limitation that Oracle (<a href="https://oracle-base.com/articles/9i/sql-new-features-9i" target="_blank" rel="noopener noreffer">9i Release 1 - 9.0.1</a> in 2001) solved: they focused on querying <em>other databases</em> or required middleware. Oracle&rsquo;s abstraction treated local flat files as first-class read-only table objects using the familiar <code>CREATE TABLE ... ORGANIZATION EXTERNAL</code> DDL syntax, providing a simple way to define external files as part of normal table creation and allowing ORACLE_LOADER access to query flat files (CSV, fixed-width, delimited) through DBAs.</p>
<p>It was an early way of separating declaration from compute (the Oracle loaders).</p>
<h2 id="why-external-tables-what-are-their-benefits">Why External Tables? What Are Their Benefits?</h2>
<p>But why use external tables? What makes them so useful that they persisted? Why have they <strong>survived so long</strong>, and why are they getting added to Databricks and other major platforms?</p>
<p>For that, we need to look at external tables&rsquo; benefits. The first reason is that external tables can simplify data access to <strong>avoid developing ETL pipelines</strong>, moving data out of the source, and re-ingesting it in our data warehouse. They make external data accessible easily, defined in a tabular form by a database schema with column types. Typical cloud data warehouses like Snowflake and Azure use them to link existing data from object storage easily without moving data. This makes the object storage files accessible for almost any downstream tool or query language in a simple and cost-effective way.</p>
<p>Other ways of using them are to store some data on <strong>cheaper storage</strong> (e.g., object storage over data warehouse storage) and only link them in. It&rsquo;s slower to fetch, but more affordable to keep. If you have large data sets, cost savings can be immense as this article <a href="https://medium.com/@abhidutty/optimize-data-storage-costs-by-70-using-databricks-snowflake-aws-s3-332f44949e93" target="_blank" rel="noopener noreffer">shows</a>, bringing down Snowflake internal storage cost from ~$23/TB/month to S3 infrequent access with ~$12.50/TB or S3 Glacier Deep Archive with only ~$1/TB.</p>
<p>Another handy side effect as the consumer of external table data is that the <strong>data is always up to date</strong>, because no refresh or update is needed. It goes without saying that this has its own downsides and can be a problem for the owner of the data if it&rsquo;s used in production and the ETL process reads large amounts of data through external tables. This will affect upstream apps running or owning this data.</p>
<p>That&rsquo;s why many use external tables in combination with materialized views (MVs) to truncate and recreate a daily snapshot (or similar) during off-peak (mostly nights) of this data, avoiding affecting production data and even optimizing query performance with added indices for downstream queries.</p>
<h3 id="when-internal-and-when-external-data-whats-the-limit-of-external">When Internal and When External Data? What&rsquo;s the Limit of External?</h3>
<p>The tradeoffs come down to how often the data is queried, e.g. the hot versus cold question.</p>
<p>The tradeoffs and considerations you should make when wanting to use them come down to the decision of how often the data is queried. The table below shows it in more detail:</p>
<table>
  <thead>
      <tr>
          <th>Dimension</th>
          <th><strong>Internal Storage</strong></th>
          <th><strong>External Tables</strong></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Temperature</td>
          <td><strong>Hot</strong>: recent data, lasts weeks to months</td>
          <td><strong>Cold</strong>: archival or infrequently touched</td>
      </tr>
      <tr>
          <td>Typical use case</td>
          <td>Dashboards, frequent queries, sub-second latency</td>
          <td>Archival, ad-hoc exploration, augmenting a data lake</td>
      </tr>
      <tr>
          <td>Query speed</td>
          <td>Fast, optimized for repeated access</td>
          <td>Slower (a 1.3×–1.7× tax in the below dashboard benchmark)</td>
      </tr>
      <tr>
          <td>Storage cost</td>
          <td>Higher (warehouse-managed, ~$23/TB on Snowflake capacity)</td>
          <td>Lower: up to ca. 20× cheaper on S3 Glacier Deep Archive (~$1/TB)</td>
      </tr>
      <tr>
          <td>Data freshness</td>
          <td>Can go stale between ETL refreshes</td>
          <td>Always up to date, no refresh needed</td>
      </tr>
      <tr>
          <td>Setup effort</td>
          <td>Requires ETL pipelines, scripts or re-ingestion</td>
          <td>Simple DDL-only definition, data stays in place</td>
      </tr>
      <tr>
          <td>Scaling concern</td>
          <td>Disk grows faster than compute needs</td>
          <td>Heavy reads can affect upstream apps owning the source files</td>
      </tr>
      <tr>
          <td>Operational overhead</td>
          <td>Predictable, managed by the warehouse</td>
          <td>Small-file problem and manifest management for tiny or streaming datasets</td>
      </tr>
  </tbody>
</table>
<p>In the era of data lake and lakehouse architectures, this is an important consideration. VSCO <a href="https://eng.vsco.co/querying-s3-data-with-redshift-spectrum/" target="_blank" rel="noopener noreffer">says</a>: &ldquo;disk space was growing more quickly than our compute needs,&rdquo; which is what triggered the adoption of external tables.</p>
<p>If you look at your use case, if you need to do analytics across various sources with joins and augmentation of your data at an enterprise, you probably want to focus on loading data into your database or data warehouse, an architectural pattern that has survived more than 30 years. But if you have data that is external and small but you want to join it with existing data, or you always need fresh data and can live with a slower response time (maybe because it runs during the night), you might use external tables.</p>
<p>In any case, external tables are a good approach to keep in mind and a valuable <a href="https://motherduck.com/blog/data-engineering-toolkit-essential-tools/" target="_blank" rel="noopener noreffer">toolkit</a> to have.</p>
<h3 id="they-work-well-with-existing-tech-and-common-patterns">They Work Well with Existing Tech and Common Patterns</h3>
<p>Obviously, today&rsquo;s external tables are not the same as the earliest ones in Microsoft Access, but the principle of accessing data outside your system is still the same. Nowadays we have more support, new formats besides CSV and JSON. We can do Parquet or open table formats.</p>
<p>As mentioned, they work well with related long-lasting data warehouse patterns and applications such as materialized views and stored procedures. The recurring pattern is to access external data with your data management system, similar to the pattern of materialized views that refresh complex SQL statements and make them fast, and stored procedures that run glue code within your database.</p>
<p>Moreover, there are temporary tables that are similar but only available during a transaction or session. They all work in the same Lindy effect, e.g., Databricks just <a href="https://www.databricks.com/blog/introducing-temporary-tables-databricks-sql" target="_blank" rel="noopener noreffer">announced Temporary table support</a> recently on December 9th, 2025, or Databricks SQL Stored Procedure a <a href="https://www.databricks.com/blog/introducing-sql-stored-procedures-databricks" target="_blank" rel="noopener noreffer">little earlier</a>, August 14th, 2025, for reusing existing SQL statements.</p>
<p>Again and again, <strong>everything that is old will be new again</strong>. Exactly what the Lindy Effect is all about. We can clearly say that the Lindy effect over the last 33 years applies here. The longer something is in place, the more likely it is to be around for at least that long.</p>
<blockquote>
<p>[!info] External vs. Temporary Table</p>
<p>In contrast: temp table = session-scoped, writable, fast, invisible to others, auto-dropped. External table = persistent metadata, read-only, infinite size, visible to all, optimized for cost.</p>
<p>A common chain in practice is going from: <code>external table → temp/transient table → permanent managed table</code>.</p>
</blockquote>
<h3 id="how-a-classical-external-table-works">How a Classical External Table Works</h3>
<p>To understand how traditional external tables work, let&rsquo;s first look at Oracle, which has built an extensive syntax around them and where they still work this way today.</p>
<p>First, we can create a place for external data called <code>DIRECTORIES</code>, which is simply a pointer or alias to a file system location where external files already exist:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">OR</span><span class="w"> </span><span class="k">REPLACE</span><span class="w"> </span><span class="n">DIRECTORY</span><span class="w"> </span><span class="n">admin_dat_dir</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">AS</span><span class="w"> </span><span class="s1">&#39;/flatfiles/data&#39;</span><span class="p">;</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>This directory can point to local file systems, NFS mounts, or even cloud object storage today (with the <code>ORACLE_BIGDATA</code> driver for S3, OCI, Azure). The <code>DIRECTORIES</code> don&rsquo;t require moving data, though you could prepare those files via ETL pipelines or third-party tools, or they can be generated directly by applications.</p>
<p>We can now create an external table based on this directory, e.g., log files, bad data that we store externally, JSON files, and make data accessible inside the <a href="https://en.wikipedia.org/wiki/Information_schema" target="_blank" rel="noopener noreffer">INFORMATION_SCHEMA</a> and with plain SQL, as if it were internal.</p>
<p>Creating an external table:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span><span class="lnt">28
</span><span class="lnt">29
</span><span class="lnt">30
</span><span class="lnt">31
</span><span class="lnt">32
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">admin_ext_employees</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                   </span><span class="p">(</span><span class="n">employee_id</span><span class="w">       </span><span class="nb">NUMBER</span><span class="p">(</span><span class="mi">4</span><span class="p">),</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">                    </span><span class="n">first_name</span><span class="w">        </span><span class="n">VARCHAR2</span><span class="p">(</span><span class="mi">20</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                    </span><span class="n">last_name</span><span class="w">         </span><span class="n">VARCHAR2</span><span class="p">(</span><span class="mi">25</span><span class="p">),</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">                    </span><span class="n">job_id</span><span class="w">            </span><span class="n">VARCHAR2</span><span class="p">(</span><span class="mi">10</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                    </span><span class="n">manager_id</span><span class="w">        </span><span class="nb">NUMBER</span><span class="p">(</span><span class="mi">4</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                    </span><span class="n">hire_date</span><span class="w">         </span><span class="nb">DATE</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                    </span><span class="n">salary</span><span class="w">            </span><span class="nb">NUMBER</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span><span class="mi">2</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                    </span><span class="n">commission_pct</span><span class="w">    </span><span class="nb">NUMBER</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="mi">2</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                    </span><span class="n">department_id</span><span class="w">     </span><span class="nb">NUMBER</span><span class="p">(</span><span class="mi">4</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                    </span><span class="n">email</span><span class="w">             </span><span class="n">VARCHAR2</span><span class="p">(</span><span class="mi">25</span><span class="p">)</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">                   </span><span class="p">)</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">     </span><span class="n">ORGANIZATION</span><span class="w"> </span><span class="k">EXTERNAL</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">     </span><span class="p">(</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="k">TYPE</span><span class="w"> </span><span class="n">ORACLE_LOADER</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="k">DEFAULT</span><span class="w"> </span><span class="n">DIRECTORY</span><span class="w"> </span><span class="n">admin_dat_dir</span><span class="w">  </span><span class="c1">--notice this dir with above
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="k">ACCESS</span><span class="w"> </span><span class="k">PARAMETERS</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="p">(</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">         </span><span class="n">records</span><span class="w"> </span><span class="n">delimited</span><span class="w"> </span><span class="k">by</span><span class="w"> </span><span class="n">newline</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">         </span><span class="n">badfile</span><span class="w"> </span><span class="n">admin_bad_dir</span><span class="p">:</span><span class="s1">&#39;empxt%a_%p.bad&#39;</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">         </span><span class="n">logfile</span><span class="w"> </span><span class="n">admin_log_dir</span><span class="p">:</span><span class="s1">&#39;empxt%a_%p.log&#39;</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">         </span><span class="n">fields</span><span class="w"> </span><span class="n">terminated</span><span class="w"> </span><span class="k">by</span><span class="w"> </span><span class="s1">&#39;,&#39;</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">         </span><span class="n">missing</span><span class="w"> </span><span class="n">field</span><span class="w"> </span><span class="k">values</span><span class="w"> </span><span class="k">are</span><span class="w"> </span><span class="k">null</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">         </span><span class="p">(</span><span class="w"> </span><span class="n">employee_id</span><span class="p">,</span><span class="w"> </span><span class="n">first_name</span><span class="p">,</span><span class="w"> </span><span class="n">last_name</span><span class="p">,</span><span class="w"> </span><span class="n">job_id</span><span class="p">,</span><span class="w"> </span><span class="n">manager_id</span><span class="p">,</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">           </span><span class="n">hire_date</span><span class="w"> </span><span class="nb">char</span><span class="w"> </span><span class="n">date_format</span><span class="w"> </span><span class="nb">date</span><span class="w"> </span><span class="n">mask</span><span class="w"> </span><span class="s2">&#34;dd-mon-yyyy&#34;</span><span class="p">,</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">           </span><span class="n">salary</span><span class="p">,</span><span class="w"> </span><span class="n">commission_pct</span><span class="p">,</span><span class="w"> </span><span class="n">department_id</span><span class="p">,</span><span class="w"> </span><span class="n">email</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">         </span><span class="p">)</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="p">)</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="k">LOCATION</span><span class="w"> </span><span class="p">(</span><span class="s1">&#39;empxt1.dat&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;empxt2.dat&#39;</span><span class="p">)</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">     </span><span class="p">)</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">     </span><span class="n">PARALLEL</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">     </span><span class="n">REJECT</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="n">UNLIMITED</span><span class="p">;</span><span class="w"> 
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>The first and most important choice is <code>TYPE</code>, which determines the access driver and what kind of files you can read: <code>ORACLE_LOADER</code> for plain text files like CSV or logs (read-only), <code>ORACLE_DATAPUMP</code> for Oracle binary dump files, <code>ORACLE_BIGDATA</code> for cloud object stores like S3 or OCI in formats like Parquet or Avro, and <code>ORACLE_HIVE</code> for Hadoop/Hive data. The <code>DEFAULT DIRECTORY</code> points to a server-side path alias, and <code>LOCATION</code> names the actual file(s), with wildcard support (<code>*.dat</code>) so you can load a whole batch at once.</p>
<p>The <code>ACCESS PARAMETERS</code> block is where you control parsing: row and field delimiters, null handling, custom date format masks, and where to write bad rows (<code>badfile</code>) and parse logs (<code>logfile</code>). On top of that, <code>PARALLEL</code> lets Oracle split file reading across multiple processes for large files, and <code>REJECT LIMIT</code> controls fault tolerance. Set it to <code>UNLIMITED</code> to skip bad rows silently, or <code>0</code> to fail immediately on the first error.</p>
<p>You see lots of built-in features that we can use compared to building a full-fledged data pipeline. Instead of exporting and importing CSVs from the source databases or developing a complex CDC pipeline that traditionally looked something like: <code>source OLTP --&gt; CSVs --&gt; IDW (reports on yesterday) -&gt; ingest into DWH for long-term analytics</code>, we can just define a table based on external data and access it as part of our pipeline.</p>
<blockquote>
<p>[!tip] The INFORMATION_SCHEMA analogy</p>
<p>You are probably familiar with the INFORMATION_SCHEMA of a database. It&rsquo;s the <strong>internal data catalog</strong> that most databases provide and it contains a <strong>list of all tables and all metadata</strong> such as columns, data types, etc. The neat thing is that external tables will show up as internal tables once defined.</p>
</blockquote>
<h2 id="whats-the-modern-version-of-external-tables-today">What&rsquo;s the Modern Version of External Tables Today?</h2>
<p>To preface: the previous Oracle example shows the <code>CREATE EXTERNAL TABLE</code> syntax, and a first-class DDL object in the data catalog. What follows in this chapter is the next evolution, where external tables are not necessarily created with DDL, but in another way, achieving the same outcome of querying data in place without loading it. Let&rsquo;s see what these are.</p>
<h3 id="integrated-into-warehouses">Integrated into Warehouses</h3>
<p>Most modern warehouses - Snowflake, Redshift Spectrum, BigQuery, Athena, Synapse - come with a simplified version of <code>CREATE EXTERNAL TABLE</code>. Compared to the Oracle example, the schema is usually inferred from the file format (especially Parquet), S3 or another object store is the default backing location, and the parsing ceremony disappears. The pseudo-code looks roughly like this across engines:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="c1">-- Pseudo-code: modern external table over Parquet on S3
</span></span></span><span class="line"><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">EXTERNAL</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">sales</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">WITH</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">LOCATION</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;s3://my-bucket/sales/&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="n">FORMAT</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;PARQUET&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">);</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>Object storage like S3, GCS, and Azure Blob has become the first-class citizen for external data. From here, the ecosystem layers on: dbt wraps this in YAML, DuckDB skips the DDL entirely in favor of schema-on-read, and open table formats add transactional guarantees on top.</p>
<h3 id="external-tables-with-dbt">External Tables with dbt?</h3>
<p>On top of this base SQL form, dbt adds a YAML layer and can be used with its own package called <a href="https://github.com/dbt-labs/dbt-external-tables" target="_blank" rel="noopener noreffer"><code>dbt-external-tables</code></a>. It&rsquo;s one of the most-used dbt packages, though it seems less actively maintained now.</p>
<p>The external table is defined via YAML, and there are lots of options to set, with the most important being <code>external</code> and its <code>location</code>, but also defining <code>columns</code> in different ways such as inference or the <code>meta</code> tag:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span><span class="lnt">28
</span><span class="lnt">29
</span><span class="lnt">30
</span><span class="lnt">31
</span><span class="lnt">32
</span><span class="lnt">33
</span><span class="lnt">34
</span><span class="lnt">35
</span><span class="lnt">36
</span><span class="lnt">37
</span><span class="lnt">38
</span><span class="lnt">39
</span><span class="lnt">40
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">version</span><span class="p">:</span><span class="w"> </span><span class="m">2</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">sources</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">snowplow</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">tables</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">event</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="p">&gt;</span><span class="sd">
</span></span></span><span class="line"><span class="cl"><span class="sd">            This source table is actually a set of files in external storage.
</span></span></span><span class="line"><span class="cl"><span class="sd">            The dbt-external-tables package provides handy macros for getting
</span></span></span><span class="line"><span class="cl"><span class="sd">            those files queryable, just in time for modeling.
</span></span></span><span class="line"><span class="cl"><span class="sd">                            </span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">external</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">location:         # required</span><span class="p">:</span><span class="w"> </span><span class="l">S3 file path, GCS file path, Snowflake stage, Synapse data source</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="l">...              </span><span class="w"> </span><span class="c"># database-specific properties of external table</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">partitions</span><span class="p">:</span><span class="w">       </span><span class="c"># optional</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">collector_date</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">              </span><span class="nt">data_type</span><span class="p">:</span><span class="w"> </span><span class="l">date</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">              </span><span class="l">...          </span><span class="w"> </span><span class="c"># database-specific properties</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="c"># Specify ALL column names + datatypes.</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="c"># Column order must match for CSVs, column names must match for other formats.</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="c"># Some databases support schema inference.</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">columns</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">app_id</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="nt">data_type</span><span class="p">:</span><span class="w"> </span><span class="l">varchar(255)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Application ID&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">platform</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="nt">data_type</span><span class="p">:</span><span class="w"> </span><span class="l">varchar(255)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Platform&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="l">...</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="c"># Use `meta` to pass custom column properties (e.g. alias, expression)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">columns</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">raw_timestamp</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="nt">data_type</span><span class="p">:</span><span class="w"> </span><span class="l">timestamp</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="nt">config</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">              </span><span class="nt">meta</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                </span><span class="nt">alias</span><span class="p">:</span><span class="w"> </span><span class="l">event_timestamp      </span><span class="w"> </span><span class="c"># rename the column in the external table</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                </span><span class="nt">expression</span><span class="p">:</span><span class="w"> </span><span class="l">TO_TIMESTAMP(...)</span><span class="w"> </span><span class="c"># custom SQL expression instead of default value extraction</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>This is a nice improvement over the ODBC GUI interface. It&rsquo;s not exactly an apples-to-apples comparison as dbt itself is not a database, but with its supported destinations such as Redshift (Spectrum), Snowflake, BigQuery, Spark, Synapse, and Azure SQL, you see that it will persist in these destinations, mostly data warehouses.</p>
<h3 id="duckdb-with-dbt">DuckDB with dbt</h3>
<p>If you use dbt, you can also use DuckDB with dbt via <a href="https://github.com/duckdb/dbt-duckdb" target="_blank" rel="noopener noreffer">dbt-duckdb</a>, which is more up-to-date. But DuckDB is not an external table, right?</p>
<p>Yes, DuckDB doesn&rsquo;t have <code>CREATE EXTERNAL TABLE</code> syntax <a href="https://github.com/duckdb/duckdb/discussions/14422" target="_blank" rel="noopener noreffer">yet</a>, mostly because it is an in-memory database, but you can achieve the same functionality through other means. DuckDB can not only be used as a database but also as a zero-copy SQL connector (see all categories at <a href="/blog/enterprise-case-duckdb-key-categories/" rel="">5 Key Categories</a>). We can just point it to an external source, as shown above with dbt. The difference is that DuckDB is both a database and a compute engine, making ad-hoc reads possible directly without a DDL definition, similar to an external table with Oracle loaders. With dbt, we can nicely declare this in dbt configs.</p>
<p>With DuckDB, you can query &ldquo;external data&rdquo; extremely fast over HTTPS or locally in formats such as Parquet, CSV, and <a href="https://duckdb.org/docs/current/data/data_sources" target="_blank" rel="noopener noreffer">many more</a>, so the need for formal external tables is reduced since DuckDB does <strong>schema on read</strong>.</p>
<p>If you want to define the database schema ahead of time, we&rsquo;d use external tables to do that and effectively have <strong>schema on write</strong> (though we don&rsquo;t write, just define the DDL table structure and data types), which is more of the classical ETL approach.</p>
<p>Here&rsquo;s an example with <code>external_location</code> to read external data with dbt:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">sources</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">external_source</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">config</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">external_location</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;s3://my-bucket/my-sources/{name}.parquet&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">tables</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">source1</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>Read more at <a href="https://duckdb.org/2025/04/04/dbt-duckdb" target="_blank" rel="noopener noreffer">Fully Local Data Transformation with dbt and DuckDB</a>.</p>
<p>Other options are with database views that are supported in DuckDB with <strong><code>CREATE VIEW</code> over <code>read_parquet()</code></strong>. You can ship a .duckdb file to clients with pre-defined views over S3 data, so clients don&rsquo;t need to know about the underlying data, Hive partitioning, or even glob patterns — very similar to what a formal <code>CREATE EXTERNAL TABLE</code> would do.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">VIEW</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="k">AS</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">read_parquet</span><span class="p">(</span><span class="s1">&#39;s3://lake/events/*.parquet&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">hive_partitioning</span><span class="o">=</span><span class="k">true</span><span class="p">);</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>Or similarly use <code>ATTACH</code> to directly point to Postgres, MySQL, SQLite, S3, and others:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="c1">-- Postgres (binary wire protocol, predicate + projection pushdown, read+write)
</span></span></span><span class="line"><span class="cl"><span class="n">INSTALL</span><span class="w"> </span><span class="n">postgres</span><span class="p">;</span><span class="w"> </span><span class="k">LOAD</span><span class="w"> </span><span class="n">postgres</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="n">ATTACH</span><span class="w"> </span><span class="s1">&#39;dbname=postgres user=postgres host=127.0.0.1&#39;</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">pg</span><span class="w"> </span><span class="p">(</span><span class="k">TYPE</span><span class="w"> </span><span class="n">postgres</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="n">ATTACH</span><span class="w"> </span><span class="s1">&#39;postgresql://user@host/db&#39;</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">pg</span><span class="w"> </span><span class="p">(</span><span class="k">TYPE</span><span class="w"> </span><span class="n">postgres</span><span class="p">,</span><span class="w"> </span><span class="n">READ_ONLY</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">-- MySQL (via MariaDB Connector/C; Postgres-style keyvalue string even for MySQL — easy trap)
</span></span></span><span class="line"><span class="cl"><span class="n">INSTALL</span><span class="w"> </span><span class="n">mysql</span><span class="p">;</span><span class="w"> </span><span class="k">LOAD</span><span class="w"> </span><span class="n">mysql</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="n">ATTACH</span><span class="w"> </span><span class="s1">&#39;host=localhost user=root port=0 database=mysql&#39;</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">mdb</span><span class="w"> </span><span class="p">(</span><span class="k">TYPE</span><span class="w"> </span><span class="n">mysql</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">-- SQLite (file opens directly; multi-reader single-writer by SQLite file locks)
</span></span></span><span class="line"><span class="cl"><span class="n">INSTALL</span><span class="w"> </span><span class="n">sqlite</span><span class="p">;</span><span class="w"> </span><span class="k">LOAD</span><span class="w"> </span><span class="n">sqlite</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="n">ATTACH</span><span class="w"> </span><span class="s1">&#39;sakila.db&#39;</span><span class="w"> </span><span class="p">(</span><span class="k">TYPE</span><span class="w"> </span><span class="n">sqlite</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">-- Generic remote DuckDB file
</span></span></span><span class="line"><span class="cl"><span class="n">ATTACH</span><span class="w"> </span><span class="s1">&#39;s3://duckdb-blobs/databases/stations.duckdb&#39;</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">stations_db</span><span class="p">;</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><h3 id="open-table-formats-and-lakehouse-architecture">Open Table Formats and Lakehouse Architecture</h3>
<p>That begs the question of whether <a href="https://motherduck.com/blog/open-lakehouse-stack-duckdb-table-formats/" target="_blank" rel="noopener noreffer">Open Table Formats</a> are the next evolution and modern way of external tables. These table formats allow almost any SQL compute engine to use them as external tables, and read, compute, and aggregate as a database would.</p>
<p>If we look at what table formats consist of, they&rsquo;re built on object storage, with a file format like Parquet, and then we have a manifest file that contains a list of files that <strong>unifies multiple single files into a &ldquo;single&rdquo; table</strong>, looking from the outside.</p>
<p>So again, the manifest file is our pointer or fancier symlink, but it lives next to the data, unlike external tables. There&rsquo;s much more going on in table formats, but if we have a <strong>data lake with open table format tables</strong>, we can see how we define tables in DDL and the <strong>pointers are to different files</strong> (Parquet, ORC, Avro), in most cases Parquet.</p>
<p>More broadly, we can say external tables decouple storage from compute. Open table formats decouple the table itself (schema, history, transactions, statistics) from any single engine.</p>
<h3 id="lakehouse-and-connecting-to-ducklake">Lakehouse and Connecting to DuckLake</h3>
<p>One step further is obviously a lakehouse architecture, with the shift from <em>format-agnostic file reading</em> to <em>governed, transactional, multi-engine open table formats</em>.</p>
<p>If you extend the external table idea to a <a href="https://motherduck.com/blog/from-data-lake-to-lakehouse-duckdb-portable-catalog/" target="_blank" rel="noopener noreffer">lakehouse architecture</a>, these external tables with open table formats provide essentially what databases provide with ACID guarantees, time travel, schema evolution, partition evolution, and fine-grained access control, but for files.</p>
<p>But with the difference that data stays in open Parquet file format on customer-owned cloud storage. The external table, once a humble workaround for avoiding data loads, has become the architectural foundation of the data lakehouse if you like this analogy.</p>
<p>With <a href="https://ducklake.select/" target="_blank" rel="noopener noreffer">DuckLake</a>, we have the next evolution just around the corner, bringing back exactly that missing database, especially to handle all the metadata of such a lakehouse and all its files. This means having durable and consistent database storage for our <a href="https://iceberg.apache.org/spec/#manifests" target="_blank" rel="noopener noreffer">manifest files</a>.</p>
<h4 id="open-data-catalog-to-complete-the-picture-the-odbc-glue">Open Data Catalog to Complete the Picture: The ODBC Glue</h4>
<p>With all these evolutions, we&rsquo;ve come far. When adding an <a href="https://www.ssp.sh/brain/open-table-format-catalogs" target="_blank" rel="noopener noreffer">Open Data Catalog</a>, we are exactly where we started: having an INFORMATION_SCHEMA, a dictionary with all our tables, in this case the open table format tables.</p>
<p>It&rsquo;s the <strong>glue that ODBC provided when connecting a BI tool to the underlying database</strong>. Now you&rsquo;d like to have an open data catalog that, in the best-case scenario, gives you all the tables and ways to connect.</p>
<p>But then again, the syntax of <code>EXTERNAL TABLES</code> still gets added, and <a href="https://arrow.apache.org/docs/format/ADBC.html" target="_blank" rel="noopener noreffer">ADBC</a> and DuckDB are doing a great job of using external data without needing a data lake and its technology stack altogether. For example, DuckDB has support for <a href="https://duckdb.org/docs/current/core_extensions/odbc/overview" target="_blank" rel="noopener noreffer">ODBC</a>, <a href="https://duckdb.org/docs/current/clients/adbc" target="_blank" rel="noopener noreffer">ADBC</a> and even <a href="https://duckdb.org/docs/current/clients/java" target="_blank" rel="noopener noreffer">JDBC</a>. That matters especially for 3rd-party tools: ADBC streams Apache Arrow end-to-end instead of serializing row-by-row, so BI tools and notebooks can pull millions of rows directly from external Parquet tables at speeds that previously required keeping data &ldquo;hot&rdquo; in a cloud data warehouse. 🙂</p>
<blockquote>
<p>[!note] ADBC, what is that?<br>
ODBC is 30+ years old, and we have a newer, faster version of it, called <a href="https://arrow.apache.org/docs/format/ADBC.html" target="_blank" rel="noopener noreffer">ADBC</a>. It&rsquo;s a faster way to connect to other databases with a columnar-oriented API instead of <strong>row-by-row serialization</strong>, heavily making use of Apache Arrow.</p>
<p>While ADBC is newer, it tries to support the same drivers as ODBC, but faster and easier to install. E.g., it has a handy <a href="https://github.com/columnar-tech/dbc" target="_blank" rel="noopener noreffer">dbc</a> CLI to install it on almost any programming language, so no more manual and error-prone Windows GUI ODBC downloading of drivers and definitions needed, just one CLI command.</p>
</blockquote>
<blockquote>
<p>[!tip] Using MotherDuck<br>
If you want a data warehouse that just works, integrates well with DuckDB, and has support for DuckLake, you can always use managed MotherDuck. You can build a classical data warehouse with plain SQL, you can read external data easily with DuckDB or dbt-duckdb, or <a href="https://motherduck.com/blog/announcing-ducklake-1-0-on-motherduck/" target="_blank" rel="noopener noreffer">integrate with DuckLake</a>.</p>
<p>It works great <a href="https://motherduck.com/blog/motherduck-agent-skills/" target="_blank" rel="noopener noreffer">with agents</a>. Check out MotherDuck&rsquo;s <a href="https://github.com/motherduckdb/agent-skills/" target="_blank" rel="noopener noreffer">agent-skills</a> for opinionated AI skills for building applications with MotherDuck. And <a href="https://motherduck.com/product/dives/" target="_blank" rel="noopener noreffer">visualize with Dives</a> with one prompt.</p>
</blockquote>
<h2 id="which-is-faster-a-quick-benchmark">Which Is Faster? A Quick Benchmark</h2>
<p>To put numbers behind the hot/cold decision, I ran a simple benchmark on the TPC-H SF=1 <code>lineitem</code> table (6M rows, ~150 MB), stored four ways: inside a DuckDB file (internal), as raw Parquet, as an Iceberg table, and as a DuckLake table. Full code: <a href="https://github.com/sspaeti/external-table-benchmark/blob/main/bench2.py" target="_blank" rel="noopener noreffer"><code>bench2.py</code></a> and <a href="https://github.com/sspaeti/external-table-benchmark/blob/main/metadata_bench.py" target="_blank" rel="noopener noreffer"><code>metadata_bench.py</code></a>.</p>
<p><strong>Dashboard workload (hot path)</strong>: 3 queries × 10 repeats:</p>
<table>
  <thead>
      <tr>
          <th>Backend</th>
          <th>Tier</th>
          <th>Median</th>
          <th>p95</th>
          <th>vs internal</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Internal (DuckDB)</td>
          <td>hot</td>
          <td>23.8 ms</td>
          <td>235 ms</td>
          <td><strong>1.0×</strong></td>
      </tr>
      <tr>
          <td>DuckLake</td>
          <td>cold</td>
          <td>45.1 ms</td>
          <td>269 ms</td>
          <td>1.3×</td>
      </tr>
      <tr>
          <td>External Parquet</td>
          <td>cold</td>
          <td>41.3 ms</td>
          <td>271 ms</td>
          <td>1.4×</td>
      </tr>
      <tr>
          <td>External Iceberg</td>
          <td>cold</td>
          <td>56.1 ms</td>
          <td>377 ms</td>
          <td>1.7×</td>
      </tr>
  </tbody>
</table>
<p>Internal is fastest; external pays a 1.3×–1.7× tax. But for <strong>cold/archival queries</strong> (one-off, no warmup), all four backends answered in under 150 ms. The speed difference effectively vanishes for data you query once a week.</p>
<p><strong>Storage cost</strong> is where external tables shine. Columnar Parquet is ~40% smaller than native DuckDB format. Ten TB of archive data costs roughly ~$125/month on S3 Infrequent Access or ~$10/month on Glacier Deep Archive, versus ~$230/month inside Snowflake on capacity pricing. This is the economic case external tables were invented for, and it still holds.</p>
<p><strong>Metadata workload</strong> is where DuckLake stands out. Fifty single-row inserts showed DuckLake creating <strong>zero data files</strong> (rows inlined in the catalog) versus Iceberg&rsquo;s <strong>352 files</strong> (201 data + 151 metadata). That&rsquo;s the &ldquo;small file problem&rdquo; made concrete: at one write per second, Iceberg creates ~86,400 files per day needing compaction. DuckLake creates zero until you checkpoint. DuckDB Labs&rsquo; own benchmarks report up to <a href="https://ducklake.select/2026/04/02/data-inlining-in-ducklake/" target="_blank" rel="noopener noreffer">926× faster queries</a> on streaming workloads.</p>
<h2 id="so-should-you-use-external-tables">So Should You Use External Tables?</h2>
<p>So after all this, should you use external tables today? After seeing how sticky they&rsquo;ve been since Oracle 9i in 2001, how they keep getting re-added to newer tools (Snowflake in 2021, Databricks Unity Catalog, BigLake in 2022), and how their core benefit is. Accessing data where it lives without moving it, via a simple DDL statement, has only grown more valuable as formats have evolved from CSV to Parquet, JSON, Avro, and now open table formats. I&rsquo;d say yes. But choose wisely based on your data&rsquo;s temperature: use internal storage for hot data, such as dashboards and frequently used queries.</p>
<p>Use external tables for cold data, archival workloads, and ad-hoc exploration, where that gap vanishes, and storage costs plummet (up to 20× cheaper on Glacier Deep Archive vs. warehouse-managed storage). And if you already use dbt, DuckDB, or a lakehouse stack, the modern versions are right there. Where they&rsquo;re the <em>wrong</em> choice is the inverse: transactional workloads, queries that need sub-second latency on every run, or data so small that the operational overhead of an external stage outweighs the benefit of not loading it.</p>
<p>The evolution is worth naming explicitly: &ldquo;read CSVs on disk&rdquo; → &ldquo;read Parquet on HDFS&rdquo; → &ldquo;read Parquet on S3 via a metastore&rdquo; → &ldquo;read Iceberg/Delta tables with ACID on S3&rdquo; → &ldquo;the Iceberg table <em>is</em> the warehouse table&rdquo;. Each step kept the core idea (data stays where it lives, metadata describes it, SQL queries it) and added database semantics back in. With open data catalogs, the warehouse becomes a <strong>stateless rental over a bucket you own</strong>, and external tables are increasingly managed. DuckLake demonstrates this best: when the catalog has SQL-DB-like guarantees, the distinction between &ldquo;external&rdquo; and &ldquo;internal&rdquo; dissolves. The metadata benchmark made this concrete by reading a single indexed row rather than walking a manifest tree.</p>
<p>The <strong>database semantics are returning</strong> with DuckLake, managed Iceberg, and predictive optimization, all of which reintroduce RDBMS-style guarantees to the lake. The cycle from &ldquo;external table for cheap storage&rdquo; to &ldquo;external table as a full ACID database on S3&rdquo; took 25 years, completing the journey back to database principles while maintaining the separation of storage and compute. You can say <strong>the modern external table isn&rsquo;t external anymore</strong>. DuckDB reads them directly, and DuckLake handles the metadata that multifile lakehouse architectures would otherwise drown in. The lesson from history is that whenever someone tries to replace it, the pattern is that reading data in place always beats moving it. And the Lindy Effect suggests that if external tables have lasted 25 years and get re-added, they&rsquo;ll persist another 25. They&rsquo;re probably not going anywhere. 🙂</p>
<hr>
<pre class=""><em>Full article published at <a href="https://motherduck.com/blog/internal-vs-external-storage-whats-the-limit-of-external-tables/" target="_blank" rel="noopener noreferrer">MotherDuck.com</a> - written as part of <a href="/services">my services</a></em></pre>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>A so-called loader that lets you access the data via a driver: see the ORACLE_LOADER Access Driver example: <a href="https://docs.oracle.com/en/database/oracle/oracle-database/12.2/sutil/oracle_loader-access-driver.html" target="_blank" rel="noopener noreffer">https://docs.oracle.com/en/database/oracle/oracle-database/12.2/sutil/oracle_loader-access-driver.html</a>&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>Also see the latest release notes of BigQuery from April 2026; lots of it has to do with &ldquo;external catalogs&rdquo; and also BigQuery Apache Iceberg external tables now support Iceberg version 3: <a href="https://docs.cloud.google.com/bigquery/docs/release-notes#April_21_2026" target="_blank" rel="noopener noreffer">https://docs.cloud.google.com/bigquery/docs/release-notes#April_21_2026</a>&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3">
<p>Customers can also choose to create materialized views on external tables to speed up the query performance significantly.&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></description>
</item>
<item>
    <title>AI Reveals Why BI Still Matters</title>
    <link>https://www.ssp.sh/blog/bi-is-not-dead-2026/</link>
    <pubDate>Tue, 21 Apr 2026 08:41:06 &#43;0200</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/bi-is-not-dead-2026/</guid><enclosure url="https://www.ssp.sh/blog/bi-is-not-dead-2026/featured-image.jpg" type="image/jpeg" length="0" /><description><![CDATA[<p>Ask a BI engineer what they actually spend their time on: it&rsquo;s not building dashboards. More often: fixing the join that broke in the overnight pipeline, untangling the metric definition that means three different things to three different teams, or getting last week&rsquo;s numbers into an Excel by Monday morning. The dashboard was always the easy part.</p>
<p>This article looks at how BI evolved, how dashboards are actually used today, and what survives when AI enters the picture — starting with the foundation that was never really about dashboards in the first place, and ending with the problem nobody in the AI hype cycle wants to talk about: who maintains it all.</p>
<h2 id="the-verdict-of-people-in-the-field-bi-is-dead-again">The Verdict of People in the Field: BI is Dead (Again)</h2>
<p>We&rsquo;ve heard it all. Business intelligence (BI), and especially dashboards, are dead. But every time, only to rediscover its power and resurrection whenever we need grounded data analysis in any enterprise and startup space. The same way Excel never dies, which arguably is still the most used BI tool.</p>
<p>If we look at what others from the data world say, it does sound similar. Hex says that <a href="https://hex.tech/blog/dashboards-were-never-the-destination/" target="_blank" rel="noopener noreffer"><strong>dashboards were never the destination</strong></a> and specifies:</p>
<blockquote>
<p>Static reporting surfaces were always a workaround for messy data and limited tooling — agentic analytics finally closes the gap between visibility and genuine insight.</p>
</blockquote>
<p>In other words, dashboards were a workaround created because data was messy, tooling was restrictive, and asking open-ended questions of the warehouse wasn’t possible without coding. Hex continues that dashboards create more questions than they answer.</p>
<p>But they also say: respect dashboards, but stop treating them as the goal:</p>
<blockquote>
<p>Dashboards still matter. They’re <strong>excellent for reporting KPIs</strong>, surfacing <strong>operational signals</strong>, and <strong>aligning leaders</strong> around shared metrics. They allow data teams to be creative in how they display these metrics and will remain a primary surface for need-to-know numbers.</p>
</blockquote>
<p>Dashboards don&rsquo;t reason or explain why something happens, as the Hex article says. Hex argues that <strong>reliable, timely, context-aware answers</strong> are the destination.</p>
<p>Mike, CEO of Rill, <a href="https://www.linkedin.com/posts/medriscoll_agents-are-blind-they-cant-see-dashboards-activity-7442685222657277952-LiRw/" target="_blank" rel="noopener noreffer">says</a>:</p>
<blockquote>
<p><strong>Agents are blind</strong>—they can’t see dashboards. But they do need access to the primitives behind them. AI, and agentic working won’t kill BI, but it will make dimensional modeling, metrics, OLAP cubes, query performance, and governance more important than pretty charts. Agents will reveal what many of us have known for a long time: <strong>BI was never about dashboards</strong>.</p>
</blockquote>
<p>Benn Stancil was already saying in 2021 that <a href="https://benn.substack.com/p/is-bi-dead" target="_blank" rel="noopener noreffer">BI is dead</a>, where he drew parallels to the Salesforce &ldquo;End of Software&rdquo; declaration in 2000. This is very interesting as it&rsquo;s almost the same statement today with AI, no more software developers needed. Benn argues that the original BI stack was one tool across the stack, but that by 2021, tools were unbundling into <strong><a href="https://benn.substack.com/p/is-bi-dead" target="_blank" rel="noopener noreffer">dedicated and specialized tools for each layer</a></strong>.</p>
<p>The proposed future of BI should focus <em>only</em> on data consumption for humans, integrating both self-serve applications and deep ad-hoc analysis. This new BI would be &ldquo;legless&rdquo; (the opposite of headless, he argued back then), relying on global governance layers rather than proprietary semantic layers, and fostering cross-functional collaboration.</p>
<p>The aim is a universal tool for all data consumption, moving beyond its current diminished state of &ldquo;visualization and reporting&rdquo;. Again, very similar to today&rsquo;s discussion, where everything is about semantics and context. Benn concluded that &ldquo;<strong>BI is dead, long live BI</strong>&rdquo; 🙂.</p>
<p>Take any modern data and analytics discipline, and you’ll probably find it has its roots in the work that has been historically carried out by Business Intelligence developers, the OG jack-of-all-trades of the data industry.</p>
<blockquote>
<p>[!note] Find more Opinions and Articles</p>
<p>Other opinions on the world wide web, whether <a href="https://www.strategy.com/software/blog/bi-is-dead-long-live-business-intelligence" target="_blank" rel="noopener noreffer">BI is dead. Long live business intelligence.</a>, or a discussion on Reddit about <a href="https://sh.reddit.com/r/analytics/comments/1nkc9gt/will_business_intelligence_skills_bi_be/" target="_blank" rel="noopener noreffer">Will Business Intelligence skills (BI) be irrelevant in like 3-4 years? : r/analytics</a> or <a href="https://sh.reddit.com/r/dataengineering/comments/1s9gd8f/ai_kill_bi/" target="_blank" rel="noopener noreffer">AI kill BI?</a>.</p>
</blockquote>
<h2 id="bi-was-never-just-about-dashboards">BI Was Never Just About Dashboards</h2>
<p>With respect to Benn&rsquo;s article in 2021, have we come full circle five years later, with the end of software engineering again, and everyone demanding semantics and metrics layers?</p>
<p>At least on the surface, it seems we are at the same point, but today we&rsquo;re going back to one &ldquo;Original BI&rdquo; stack as drawn in the image in the article, to a fully encompassed data platform. Maybe it does not need to be one single platform, but at least to the user it needs to be a <strong>single chat or AI interface</strong>, that goes end-to-end through all the layers of discovery, visualization, transformation, storage, and ingestion—or in other words, the full data engineering lifecycle.</p>
<p>So what about dashboards then? Many declare the death of many things, and dashboards are a popular target too. That&rsquo;s even more true when AI, with its generative capabilities, can just one-shot your whole dashboard and create a full-blown web app or custom HTML page. And dashboards might die, as many <strong>don&rsquo;t actually want the dashboard</strong> itself, but the extracts from your large SAP, linked to the right customers from the CRM, enhancing the decision-making process even more.</p>
<p>They want the insights from the combination of all source data and business insights that your company has over its competitor, for example. It&rsquo;s the <strong>primitives behind the dashboards</strong> that matter more.</p>
<h4 id="when-you-still-need-a-dashboard-ai-chat-is-not-enough">When You Still Need a Dashboard (AI Chat is Not Enough)</h4>
<p>Even though a chat interface or an agent can provide you with <strong>dashboard information tailored to your question</strong> and in an explanatory written form, there&rsquo;s still a need for dashboards in certain situations.</p>
<p>The obvious one is the well-crafted <strong>operational dashboard</strong>, where you can see your whole company performing in a split second by looking at a single, highly dense dashboard with individual charts and visualizations tailored to convey information about each sub-area in the best possible way. It&rsquo;s the same way a map is still needed in self-driving cars: for quick verification, to get an overview, or in case the car gets lost.</p>













  
<figure><a target="_blank" href="/blog/bi-is-not-dead-2026/car-cockpit.webp" title="">

</a><figcaption class="image-caption">A good example of operational dashboard in a car, where you need the numbers at all time | <a href="https://x.com/kyleanthony/status/2042572700468511133" target="_blank" rel="noopener noreffer">Tweet</a></figcaption>
</figure>
<p>The other obvious benefit of the operational dashboard, or any general dashboard, is that people can agree on numbers, as they&rsquo;re looking at the same agreed-upon dashboard. Hence, the calculations are the same compared to individual Excel sheets with different calculations.</p>
<p>The easy creation of multiple dashboards on the fly or chatting with AI to get insights resembles the <strong>old way of using local Excel files</strong>. Everyone is doing their own thing, with no alignment, governance, or broader verification.</p>
<p>Maps are a different type of dashboard that can&rsquo;t be replaced. Geospatial data shown on a map still beats text. Or a bar chart where you immediately scan the different proportions of stock in inventory across certain regions, compared to first having to analyze all numbers in a text-only chat reply.</p>
<h4 id="dashboards-as-a-sanity-check">Dashboards as a Sanity Check</h4>
<p>Maybe some even less obvious reasons: <strong>serendipitous discovery</strong> of anomalies and outliers that accidentally pop up in dashboards are harder to see in chats, but easy to spot visually. The same goes for <strong>ad-hoc BI</strong> when drilling down into more grain and dimensions for self-service users. A pivot table is a REPL for BI; that&rsquo;s not possible with chats.</p>
<p>Dashboards are also a lifeline to quickly check if the AI is hallucinating. What about <strong>determinism</strong>? Chat responses are not deterministic, and you might get different responses, hopefully with the same correct answer, but visualized differently because the model made different decisions the second time, or for a different user. AI agents are non-deterministic statistical models and probably always will be, so we need to bring more context and definitions to make outputs more consistent. One way is with a spec-driven development (SDD) first approach that helps define it more accurately each time.</p>
<p>The same applies to <strong><a href="https://www.rilldata.com/blog/data-modeling-for-the-agentic-era-semantics-speed-and-stewardship" target="_blank" rel="noopener noreffer">stewardship</a></strong>, reviewing and checking outcomes. As an example consider self-driving cars such as Waymos: you don&rsquo;t need a map, but if it&rsquo;s wrong or stuck, the first thing you&rsquo;d reach for is one.</p>
<p>But what does that mean for the future of BI, and what keeps BI afloat?</p>
<h2 id="the-primitives-of-a-dashboard-and-bi">The Primitives of a Dashboard (and BI)</h2>
<p>BI was and is never (only) about dashboards. BI is used for a combination of reasons. One of the most important is the <strong>metrics</strong> themselves, the business KPIs: defining your profit, defining what has to be deducted, how much taxes and salary payments are. All of it is ingrained in a single metric, or better, all of it including the hierarchy, the full <a href="https://www.youtube.com/watch?v=Dbr8jmtfZ7Q" target="_blank" rel="noopener noreffer">tree of metrics</a>, as it&rsquo;s never just one metric.</p>
<p>But the metrics tree with all its hierarchy is not all. We need <strong>speed to crunch</strong> the data extracted from the source SAP and CRM, cleansed and joined, and aggregated to the exact grain of the metric. No one has a single place to view their data end-to-end, which makes BI work so needed, integrating to either combine all sources into one API or database/warehouse for the business (or now AI chat) to pull from. Again, queries over CSV are easy. Fast queries over 1 TB of Parquet are hard.</p>
<p>This is where we need optimal data modeling and compute power to do it in seconds, best case sub-second. And when we have this, we need a good data architecture with the right tool to compress and compute that data. That&rsquo;s where we need some kind of <strong>OLAP</strong> or analytical database, optimized for analytical queries and doing aggregation on the fly based on different dimensions and grains.</p>
<p>All of these primitives are still needed, maybe even more so in times of generative AI, even if dashboards were to go away.</p>
<p>Lastly, all of this is under the umbrella of <strong>context</strong> and <strong>semantics</strong>. It&rsquo;s encoding the business-related processes into data-like artefacts to make reusable governed definitions of metrics like revenue, MAU (Monthly Active Users), ROAS (Return on Ad Spend), etc. And the right medium for that is metrics and a so-called metrics layer (or semantic layer).</p>
<h2 id="everyone-wants-to-build-dashboards-nobody-wants-to-maintain-them">Everyone Wants to Build Dashboards. Nobody Wants to Maintain Them.</h2>
<p>The elephant in the room that nobody talks about is that <strong>everyone wants to build. Nobody wants to maintain</strong>. This is the current AI phase we&rsquo;re in. Even more, nobody even wants to review tons of AI-generated code.</p>
<p>According to the story in the book <a href="https://www.amazon.com/Maintenance-Everything-Part-One/dp/1953953492" target="_blank" rel="noopener noreffer">Maintenance of Everything</a> by Stewart Brand, what makes people do maintenance is when they <strong>find joy</strong> in it and care for it.</p>
<p>This is greatly illustrated in the opening with the 1968 Golden Globe solo sailboat race, a dramatic contest of maintenance styles under life-or-death conditions, contrasting Robin Knox-Johnston&rsquo;s meticulous upkeep with Donald Crowhurst&rsquo;s neglect. Knox-Johnston, who truly <strong>enjoyed</strong> his boat <em>Suhaili</em>, reported mid-race that despite the brutal Southern Ocean ordeal, he was &ldquo;thoroughly enjoying himself&rdquo;. Decades later, he personally restored her, replacing every one of her 1,400 fastenings.</p>
<p>Stewart Brand also argues that you need to <strong>care</strong> about your product, and that&rsquo;s the key to good maintenance — a lesson drawn from <a href="https://en.wikipedia.org/wiki/Zen_and_the_Art_of_Motorcycle_Maintenance" target="_blank" rel="noopener noreffer">Zen and the Art of Motorcycle Maintenance</a> by Robert Pirsig (1974), a philosophical classic celebrated as bending the culture of the day toward honoring maintenance.</p>
<p>And one way to ease maintenance is to <strong>design maintenance-friendly</strong>. Like the <a href="https://en.wikipedia.org/wiki/Rolls-Royce_Silver_Ghost" target="_blank" rel="noopener noreffer">Rolls-Royce Silver Ghost</a>, which was engineered from the ground up for reliability and ease of upkeep.</p>
<p>So how do we apply these principles to BI? We need to create dashboards that are easy to maintain and have clear ownership, someone who cares. Otherwise, they lose long-term value, and we invest precious time in something short-term.</p>
<blockquote>
<p>[!quote] Quote by <a href="https://www.linkedin.com/posts/mehd-io_you-just-burned-1k-in-tokens-to-rebuild-share-7439331001383800832-ZVhN?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAABkA2pgBYM4xDO0z2ChYuxFhBfu4h7jp4Lo" target="_blank" rel="noopener noreffer">Mehdi</a><br>
You just burned $1k in tokens to rebuild the SaaS you refused to pay $29/month for.<br>
Great for the ego. Now maintain it for 3 years to break even.<br>
Yup, maintenance is a luxury these days.</p>
</blockquote>
<p>Generating 100s of cool but unused dashboards with AI is clearly working in the opposite direction. We can save a couple of bucks while no longer needing to maintain each custom-created dashboard. Indeed, <strong>maintenance is a luxury</strong> these days.</p>
<h4 id="self-maintaining-agents">Self-Maintaining Agents?</h4>
<p>The question is, do we need agents for the maintenance then? So that we can create new, innovative solutions, do the creative things, and leave the hard parts of BI to us humans such as getting requirements and verifying with the business. And we create agents to do the maintenance? What does maintenance even mean? What does <strong>changing the oil</strong> and checking the brakes mean for BI, or data pipelines?</p>
<p>It&rsquo;s not troubleshooting in case of an error, that&rsquo;s what agents already do and help a lot with, ideally fixing the errors themselves, in a self-healing process. But maintenance is different. <strong>Keeping up with the updates</strong> of your software, security, the right integration into your data platform with upgraded, better-performing glue code, avoiding code from becoming legacy code.</p>
<p>All of it means maintenance, and outweighs the work of just creating the data pipeline or BI dashboard by far, probably somewhere on the order of 8:2.</p>
<p>As Mike puts it well when we were discussing:</p>
<blockquote>
<p>Few can build a <strong>maintainable</strong>, scalable data infrastructure for surfacing trusted metrics. E.g. a digital platform like Coinbase isn&rsquo;t going to YOLO its internal reporting over billions of transactions. Even Claude has a usage-based billing portal, consumption metrics need to be precise, deterministic, and fast.</p>
</blockquote>
<p>The Basecamp owner similarly <a href="https://youtu.be/otvGsbeOdfc?si=2p9X7ILxJSsuJzpA&amp;t=950" target="_blank" rel="noopener noreffer">said</a>: AI is pushing back too little. Maybe it will be solved in the future with better models, but we need to live in the present, and in that present:</p>
<blockquote>
<p><strong>Agents don&rsquo;t finish beautiful, ergonomic, desirable software.</strong> They just don&rsquo;t. That human finishing at the end is not just necessary, it&rsquo;s essential.</p>
</blockquote>
<p>So the future is soon, but not yet. Back to dashboards and their maintenance: the hard part is not generating visualizations, but having metrics and a strong BI backend. Almost a unified data interface that has an agent with access to source, ETL, and dashboard.</p>
<h2 id="bi-as-code-one-solution-to-maintenance">BI-as-Code: One Solution to Maintenance?</h2>
<p>Is the solution to maintenance-friendly design maybe BI-as-Code?</p>
<p>BI-as-Code comes into play because declarative configs can be versioned and maintained, thereby avoiding the limitations of BI-as-clicks. Sure, it will not solve all problems, but having that descriptive state of A. your data infrastructure and B. your data pipelines and BI dashboards helps tremendously. In the event of an error or incorrect state, we can just roll back to the last versioned dashboard or infrastructure.</p>
<p>The only thing hard to make reversible, unless you use some kind of <a href="https://www.ssp.sh/blog/git-for-data-theory/" target="_blank" rel="noopener noreffer">Git for Data</a> workflow with LakeFS, Nessie, or others, or just use Open Table Formats with the Time Travel function, is the actual data.</p>
<p>BI-as-Code isn&rsquo;t the whole answer, but it&rsquo;s the right direction: making dashboards owned, <strong>versioned, and recoverable</strong>. Code can build the right level of <strong>abstractions</strong> for ETL, metrics queries (metrics SQL), and visualizations, where raw Python for ETL or D3 is too verbose and too brittle.</p>
<p>With agents, these abstractions come in handy once more, as agents work best with a clear interface like a CLI or API, where the abstraction helps build just that, and tune things themselves through MCP or direct access to the declarative configurations. Much of what <a href="https://www.rilldata.com/blog/bi-as-code-and-the-new-era-of-genbi" target="_blank" rel="noopener noreffer">GenBI</a> is all about. The question of what comes next: can agents take over the analyst role entirely, or how do we marry the two?</p>
<h3 id="building-bi-for-agents-not-humans">Building BI for Agents, Not Humans</h3>
<p>BI-as-Code allows agents to drive BI, or as Mike <a href="https://www.linkedin.com/posts/medriscoll_observability-cybersecurity-product-analytics-share-7444397754635968512-nZsM" target="_blank" rel="noopener noreffer">said</a>: &ldquo;AI drives <strong>compression of the data stack</strong>&rdquo;.  Meaning observability, cybersecurity, product analytics, and BI are converging. A CEO recently asked him why he couldn&rsquo;t &ldquo;kill Tableau, Looker, DataDog, Grafana, and QuickSight&rdquo; in favor of a single system. In my opinion, it doesn&rsquo;t need to be a single tool, but it should feel like a single interface.</p>
<p>Most common today: a chat prompt that autonomously spawns ingestion, transforms data into marts, and surfaces a dashboard or web app, <strong>running end-to-end analytics</strong> without the user ever thinking about the layers underneath.</p>
<p>But only the speed of faster building with AI won&rsquo;t get us there. <strong><a href="https://en.wikipedia.org/wiki/Amdahl%27s_law" target="_blank" rel="noopener noreffer">Amdahl&rsquo;s Law</a> still applies</strong>, as Jeff Dean rightly said in his <a href="https://www.youtube.com/watch?v=g8BuAtM3fp4" target="_blank" rel="noopener noreffer">talk</a> with NVIDIA&rsquo;s Bill Dally:</p>
<blockquote>
<p>An AI agent can run 50x faster, but if the tools it depends on were designed for human speed — slow query APIs, brittle CLIs, unversioned metrics — the overall gain collapses to 2-3x.</p>
</blockquote>
<p>That&rsquo;s why the primitives behind BI get <em>more</em> important as agents get faster, not less. OLAP needs to be faster, metrics need a reliable API, ETL needs to be composable. The bottleneck shifts from the model to the infrastructure it runs on.</p>
<p>And when agents do take over the analyst workflow, spawning parallel queries, discarding dead ends, surfacing the interesting slices, they&rsquo;ll expose something BI practitioners have known for years: <strong>the hard part was never the visualization</strong>.</p>
<p>It was always the semantics beneath — the governed metrics, the trusted definitions, and configs that were verified by an actual human being. Agents will just make that gap impossible to ignore.</p>
<h2 id="bi-primitives-are-infrastructure-for-ai">BI primitives are Infrastructure for AI</h2>
<p>Wrapping up, BI was never about dashboards. It was about making sense of a company&rsquo;s data, connecting the source data into something a human can understand, efficiently reusing existing metrics, and governing definitions.</p>
<p>The dashboard was just the visible surface. What survives the hype cycle, from the unbundling of the modern data stack to the rise of AI agents, are the primitives: <strong>metrics, semantics, ownership, trust</strong>.</p>
<p>The AI era doesn&rsquo;t kill that need. Agents hallucinate without a strong foundational semantic layer or verified human constraints. Non-deterministic chat interfaces collapse without business-wide, agreed-upon definitions. The maintenance problem doesn&rsquo;t disappear either when you generate faster. It compounds the problems and bottlenecks for senior engineers at a company.</p>
<p>BI-as-Code, versioned dashboards, and a governed interface aren&rsquo;t opposite to the AI future, but a necessary foundation that makes working with it easier, not only for AI systems but also for humans in the loop.</p>
<p>&ndash;</p>
<p>If you enjoyed this, there&rsquo;s further related reads that might be interesting to you:</p>
<ul>
<li><a href="/blog/agentic-data-modeling/" rel="">Data Modeling for the Agentic Era: Semantics, Speed, and Stewardship</a></li>
<li><a href="/blog/self-service-bi-ai/" rel="">Has Self-Serve BI Finally Arrived Thanks to AI?</a></li>
<li><a href="/blog/bi-as-code-and-genbi/" rel="">BI-as-Code and the New Era of GenBI</a></li>
</ul>
<hr>
<pre class=""><em>Full article published at <a href="https://www.rilldata.com/blog/ai-reveals-why-bi-still-matters-hint-its-not-dashboards" target="_blank" rel="noopener noreferrer">Rilldata.com</a> - written as part of <a href="/services">my services</a></em></pre>
]]></description>
</item>
<item>
    <title>Specs Over Vibes: Consistent AI Results ft. Mark Freeman</title>
    <link>https://www.ssp.sh/blog/specs-over-vibes-interview-mark-freeman/</link>
    <pubDate>Wed, 08 Apr 2026 00:08:08 &#43;0200</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/specs-over-vibes-interview-mark-freeman/</guid><enclosure url="https://www.ssp.sh/blog/specs-over-vibes-interview-mark-freeman/featured-image.png" type="image/png" length="0" /><description><![CDATA[<p>There&rsquo;s so much going on in the AI space, and how to work with AI agents is changing every day. Everyone is overwhelmed and almost numb from so many possibilities, yet you need to find a way to work with AI, not to get left behind, right?</p>
<p>You might use AI agents all day long, parallelizing them with AI orchestrators like Agent Teams, Gastown, tmux, git worktree, and AI-based IDEs, but in the end, you just coordinated an AI. You still have to learn what it created, understand it, check for hallucinations, and verify that it built the right thing. We&rsquo;ve all become senior reviewers, more exhausted than before, with less of the work that made this fun in the first place. Meanwhile, we are also more distracted than ever. No time to think, with Copilot, Grammarly, or something else constantly asking and suggesting.</p>
<p>This series interviews real practitioners to give you the best tips on how they use AI in their data work today, extracting as many patterns behind them as possible. The article is structured in four parts: <strong>(1)</strong> how Mark is using AI, <strong>(2)</strong> what he has learned working with it, <strong>(3)</strong> what he is specifically using it for, and <strong>(4)</strong> what he thinks about AI in general and the future.</p>
<h2 id="introducing-the-guest-1-mark-freeman">Introducing the Guest: #1 Mark Freeman</h2>
<p>The start of this series is none other than <a href="https://www.linkedin.com/in/mafreeman2/" target="_blank" rel="noopener noreffer">Mark Freeman</a>. He is currently the Head of DevRel, Employee 1 and GTM at Gable. Mark has gone through three career roles as clinical researcher, data scientist, and data engineer, which is helping him greatly in his current position to navigate the unknown of generative AI. We&rsquo;ll go more into it later.</p>
<p>Mark has also co-authored a book with O&rsquo;Reilly about <a href="https://www.amazon.com/Data-Contracts-Developing-Production-Grade-Pipelines/dp/109815763X" target="_blank" rel="noopener noreffer">Data Contracts</a> (with Chad Sanderson and B.E. Schmidt), and is helping build Gable with the best possible data flows and data quality for enterprises.</p>
<blockquote>
<p>[!abstract] TLDR</p>
<p>To set the stage, in this interview we talk about how to use Spec-Driven Development workflow with Claude Code and agent teams to produce high-quality, reproducible outcomes. We cover Mark&rsquo;s use of ExcaliDraw diagrams and JSON schemas to define requirements upfront, how he parallelizes agents with tmux to compare outputs, why AI benefits senior engineers more than juniors, and where he sees data engineering heading.</p>
</blockquote>
<h2 id="how-marks-using-ai">How Mark&rsquo;s Using AI</h2>
<p>Let&rsquo;s start with the general setup Mark uses when working with AI, and how he uses generative AI.</p>
<h3 id="how-mark-changed-his-ai-workflows">How Mark Changed His AI Workflows</h3>
<p>I asked him: &ldquo;Since you&rsquo;re building a company in the data contract and quality space and have written a book, how has working with AI changed how you use AI at work?&rdquo;</p>
<p>Mark has been in the data space since 2018 as a clinical research analyst and a data scientist since 2019. In 2022 he shifted over to data engineering, and in 2023 joined Gable to solve the problem of applying data contracts. He was very early in NLP with the <a href="https://web.archive.org/web/20211024133146/https://humu.com/blog/gain-clarity-and-context-about-what-matters-most-for-your-teams" target="_blank" rel="noopener noreffer">major ML project</a> he worked on back in 2021.</p>
<p>He remembers the early days in 2023 when ChatGPT hallucinated and when he used generative AI for the first time. Very much as a chat window <em>co-coding companion</em>, asking them architecture questions and general questions about the code at hand. Fast forward to <strong>2024 and 2025</strong>, generating more code, but not full programs and projects, but <em>by function</em> - trying to narrow down the scope.</p>
<p>And then in late 2025, <strong>Claude Code</strong> came around, and <em>changed the game</em> with better models that could autonomously solve problems for a longer period. And at the same time, everyone provided more CLIs to empower the CLI-first workflow of Claude. Mark started building by giving it instructions, pointers to docs, schema, etc., and letting it independently build data-related work and go fully agentic.</p>
<h3 id="marks-spec-driven-workflow">Mark&rsquo;s Spec Driven Workflow</h3>
<p>Mark has figured out a very well-working approach that helps him create reproducible outcomes. Not focusing on solutions, but on how the tool works as he relentlessly specs and defines what he wants with the <a href="https://martinfowler.com/articles/exploring-gen-ai/sdd-3-tools.html" target="_blank" rel="noopener noreffer">Spec Driven Development (SDD)</a> approach, inspired by <a href="https://substack.com/home/post/p-187866704" target="_blank" rel="noopener noreffer">Esco Obong</a> and how he used it at Airbnb. He uses the GitHub-provided <a href="https://github.com/github/spec-kit" target="_blank" rel="noopener noreffer">spec-kit</a>, which is a toolkit to help you get started with Spec-Driven Development.</p>
<p>I hadn&rsquo;t heard of it, and when checking it out, it&rsquo;s super well documented and integrates 1:1 into Claude Code (and many other AI agents), meaning you can use slash commands within Claude and define specs with the help of an existing git repo including docs and code such as:</p>
<ul>
<li><code>/speckit.plan</code>: Execute the implementation planning workflow using the plan template to generate design artifacts.</li>
<li><code>/speckit.tasks</code>: Generate an actionable, dependency-ordered tasks.md for the feature based on available design artifacts.</li>
<li><code>/speckit.specify</code>: Create or update the feature specification from a natural language feature description.</li>
<li><code>/speckit.analyze</code>: Perform a non-destructive cross-artifact consistency and quality analysis across spec.md, plan.md, and tasks.md after task generation.</li>
<li><code>/speckit.clarify</code>: Identify underspecified areas in the current feature spec by asking up to 5 highly targeted clarification questions and encoding answers back into the spec.</li>
<li><code>/speckit.checklist</code>: Generate a custom checklist for the current feature based on user requirements.</li>
</ul>
<p>You can define these on a per-project basis, or have some of them defined as a general spec in your <code>~/.claude/</code> folder. The outcomes are Markdown files that hold dedicated specifications, based on your goals that can then be further edited and updated based on your iterations.</p>
<h3 id="working-product-focused">Working Product-Focused</h3>
<p>This helps Mark to focus on product scenarios and <strong>predictable outcomes</strong> instead of vibe coding every piece from describing his principles from scratch, he continues.</p>
<p>He goes from ideation through specs to dedicated tasks. He likes to always start with an <a href="https://excalidraw.com/" target="_blank" rel="noopener noreffer">ExcaliDraw</a> diagram, defining more of the flow diagram, rather than architecture or other overviews. For data schema and interface definitions, he defines data structure next to the relevant flow diagram, as <a href="https://blog.mehdio.com/i/160121474/best-human-feedback-loop-with-excalidraw-and-cursor" target="_blank" rel="noopener noreffer">ExcaliDraw is JSON</a>, these can be easily integrated. Schema definitions describe accurately what&rsquo;s needed based on stakeholder discussions and his needed requirements.</p>
<p>He then passes that diagram back to Claude Code and iterates on the data model and his key assumptions. Mark takes a lot of time in this process. He will spend hours, days or even weeks in this stage, updating and refining these specs, specifically giving clear and exact information about data schema, tools to use, and architectural choices that he knows as a senior engineer that he wants and needs to have.</p>
<p>This is also where years of experience make the difference.</p>
<p>






</p>
<h3 id="using-typescript-for-data-schema-enforcement">Using TypeScript for Data Schema Enforcement</h3>
<p>An interesting discovery Mark made is that he started using a programming language new to him, TypeScript. Similar to Wes McKinney&rsquo;s <a href="https://wesmckinney.com/blog/agent-ergonomics/" target="_blank" rel="noopener noreffer">From Human Ergonomics to Agent Ergonomics</a>, where he states that &ldquo;Python Was Built for Humans, Not Agents&rdquo; and argues that he is using GoLang and Rust for agent work, as it&rsquo;s a better language for agents with minimal dependencies and shorter/better types.</p>
<p>Mark ended up using lots of TypeScript, not because he was familiar with the language, but because it&rsquo;s mostly what his work and that of a data engineer requires: <strong>defining data types</strong>. Enforcing them, quickly verifying across the data pipeline that we don&rsquo;t get an error before pipeline runtime. Saving a lot of time and upping the quality.</p>
<h2 id="what-mark-has-learned-working-with-ai">What Mark Has Learned Working with AI</h2>
<p>Over the years, Mark has changed his workflow. In this part, he shows how he uses agentic agents with tmux and how he reviews and checks the outcome.</p>
<h3 id="agent-parallelization-and-executing-them-teams-and-tmux">Agent Parallelization and Executing Them: Teams and Tmux</h3>
<p>After all the specs and focusing on them once, he uses agents to implement the specs and Claude uses the feature called <strong><a href="https://code.claude.com/docs/en/agent-teams#orchestrate-teams-of-claude-code-sessions" target="_blank" rel="noopener noreffer">Agent Teams</a></strong> (which can be activated in Claude <code>settings.json</code> with <code>CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS</code>).</p>
<p>The cool thing about agent teams is that they let you coordinate multiple Claude Code instances working together. One session acts as the team lead, coordinating work, assigning tasks, and synthesizing results. Teammates work independently, each in its own context window, and communicate directly with each other.</p>
<p>Mark spawns multiple agents using iTerm2 and tmux, which I heavily recommend for agent work (also check <a href="https://zellij.dev/" target="_blank" rel="noopener noreffer">Zellij</a> for an easier start), and the agent teams feature will automatically open the additional terminals in separate panes:</p>













  


























  
<figure>
<a target="_blank" href="/blog/specs-over-vibes-interview-mark-freeman/claude-tmux-teams.png" title="/blog/specs-over-vibes-interview-mark-freeman/claude-tmux-teams.png">

</a><figcaption class="image-caption">Example from <a href="https://x.com/nummanali/status/2031477259689754734" target="_blank" rel="noopener noreffer">X</a></figcaption>
</figure>
<p>It shows Claude self-orchestrating his own team. Think of it as similar to <a href="https://github.com/steveyegge/gastown" target="_blank" rel="noopener noreffer">Gastown</a>, <a href="https://github.com/preset-io/agor" target="_blank" rel="noopener noreffer">Agor</a>, and other <a href="https://www.ssp.sh/brain/ai-orchestrators/" target="_blank" rel="noopener noreffer">AI orchestrators</a>, but integrated into Claude.</p>
<p>Mark&rsquo;s workflow with agent teams is deliberately outcome-focused rather than code-focused. Once the agents complete their run, he checks the result against the original specs and JSON schemas, not the code itself. The only thing that matters is whether the outcome does what was defined.</p>
<h3 id="is-reviewing-code-still-needed">Is Reviewing Code Still Needed?</h3>
<p>The tough question was whether Mark still reviews code, especially when Claude can generate more of it in a minute than we can ever review.  Mark said: &ldquo;<em>Not locally or on unimportant projects where I&rsquo;m exploring the limits and potential traps of these powerful tools.</em>&rdquo;</p>
<p>But for production pipelines or when customers asked him specifically for his opinion, he said:</p>
<blockquote>
<p>Along with the wider industry, we are figuring out how to use AI safely at scale.</p>
</blockquote>
<p>Also at work when they have mission-critical services such as in a bank, you can&rsquo;t just vibe code something. It <strong>comes down to use-cases</strong>, he said.</p>
<p>Besides use cases, he tried different ways of reviewing. First he tried a sophisticated process where the above agents would create PRs and he would then comment on these with improvements and changes. The agents would then read them and integrate the given feedback and continue the process. But even that workflow made him too much of a bottleneck. It wasn&rsquo;t scalable enough.</p>
<p>Mark searched for other ways to work with it.</p>
<h4 id="outcome-driven-reviews-and-starting-from-scratch-again">Outcome-Driven Reviews: And Starting from Scratch Again</h4>
<p>What he does now is assess outcomes instead. After all the rigorous time in speccing, he tests the result by running the pipeline, creating tests, or checking the code manually the old-fashioned way.</p>
<p>The key mindset shift here is that the first build is deliberately treated as throwaway. It&rsquo;s requirements exploring via building. You implement the spec once, learn what you got wrong, and expect to discard it.</p>
<p>That&rsquo;s why he tests the outcome. And once tested, he might have gotten new learnings that he could have only gotten through implementing or with actual tests. That&rsquo;s when he will feed these learnings back to the specs and update initial requirements, and <strong>start all over again</strong>, from scratch, letting the agent create a new outcome based on the updated specs. The cycle is: <code>spec → build → assess → improve spec/assumptions → repeat</code>.</p>
<p>






</p>
<p>This way, he has an approach with a very deep and exact iteration, almost deterministic, where he can re-run the agents with updated feedback and requirements, and get the same or similar outcome with the added updates, because of the spec-driven approach and the structured approach that <em>spec-kit</em> delivers, and the dedicated way he defines his requirements, which won&rsquo;t just be hallucinated as different inputs, end-to-end.</p>
<p>Though this can always happen, this approach served him very well, with a high-quality output he can trust, and a qualitative way to <strong>approach a complex problem</strong> with the help of agents.</p>
<p>If the outcome meets the quality he expected and it does what he wants, he goes to internal stakeholders to get feedback from them. And then the same process again, updating specs, fixing requirements errors or possible wrong assumptions, and off the agents go again.</p>
<h4 id="tests-and-quality-gates">Tests and Quality Gates</h4>
<p>Tests and QA he writes manually. This is another way to make sure the outcome meets his expectations. Most important is the value, he says:</p>
<blockquote>
<p>Value first, then outcome and then worry about other things</p>
</blockquote>
<p>If it&rsquo;s not turning out to be valuable to the stakeholders, he wants to avoid spending more time. That&rsquo;s why the agent iterations and building something &ldquo;quickly&rdquo;, with rigorous specs and definitions in place, worked well for him so far.</p>
<h3 id="senior-vs-junior-working-with-ai">Senior vs. Junior: Working with AI</h3>
<p>We move on to an interesting discussion of whether AI helps senior engineers or juniors more. Mark says (he also <a href="https://www.linkedin.com/posts/mafreeman2_the-main-reason-ai-agents-help-senior-developers-activity-7437907260837777408-dMk5?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAABkA2pgBYM4xDO0z2ChYuxFhBfu4h7jp4Lo" target="_blank" rel="noopener noreffer">wrote</a> about it) that <strong>AI helps more senior engineers</strong>, as seniors &ldquo;<em>understand the trade-offs of tech debt</em>&rdquo;.</p>
<p>He says further that in AI iterations, we move much faster, generating legacy code and architecture constructs in days and weeks, instead of years. If Mark iterates with the spec-driven design explained above, there are multiple different architectures generated, some of which might have been bad from the very beginning.</p>
<p>As a senior, he thinks that we can give the right guidance from the very beginning and exclude bad outcomes and early &ldquo;legacy code&rdquo;. No doubt, there will be code and architecture to be adapted, too, but if you <strong>lack experience</strong>, you basically have <strong>no chance of knowing</strong>.</p>
<h4 id="framework-and-architectures-are-for-the-experienced">Framework and Architectures Are for the Experienced</h4>
<p>Mark mentions that at Gable, he is building something from scratch. Let&rsquo;s say we are at iteration v4: deep technical architectures are coming up, to choose an Apache Kafka infrastructure, define your schema in JSON or Avro, or use Parquet.</p>
<p>These decisions can only be made with experience. Sure, agents will give you a good middle ground, and with research they will potentially choose the right solution for the current problem. But how do you know what&rsquo;s the <strong>best solution for your given business problem</strong>? If you have built multiple data platforms and have seen many companies, you just know some of these things or developed an intuition for what&rsquo;s needed.</p>
<p>In combination with the agents, it&rsquo;s just a much better tool for seniors than for juniors who need to more or less blindly trust the assessments the agents made. The quality of outcome depends on frameworks and architectural choices, accumulating legacy code early if a big architectural component is chosen wrong.</p>
<p>In a related but further way, the knowledge is like a linter in an editor that knows things ahead of runtime. It can detect wrong choices directly.</p>
<h2 id="what-mark-is-using-ai-for">What Mark is Using AI for</h2>
<p>Besides the already discussed use cases of general workflow and reviewing outcomes, I asked him about how he uses AI at work, working with data contracts and the non-deterministic outcome of AI, for example.</p>
<h3 id="integrating-ai-into-data-contracts">Integrating AI into Data Contracts</h3>
<p>As an author of a book on data contracts, and working in the business, one of Mark&rsquo;s priorities is to somewhat safely use AI agents to either verify contracts or help define them, if in any way possible.</p>
<p>As data contracts are written definitions between two parties, mostly written in YAML or JSON, it&rsquo;s a good medium to iterate on, where agents, humans, and all stakeholders can work on specs that can be versioned. Mark says his focus is on <strong><a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents" target="_blank" rel="noopener noreffer">evals</a></strong>, specifically for assessing how well an agent completes a specific task, built around Gable&rsquo;s products or internal workflows.</p>
<p>The main goal of evals is to more <strong>confidently</strong> know that what AI shipped is any good. Similar to stewardship in Master Data Management (MDM), where humans in the process verified if the data quality was met, with AI generation we need a similar process at a faster pace.</p>
<p>That&rsquo;s also where he draws on his clinical background with an outcome-driven approach, comparing 200 observations from end-to-end coding agent simulations and assessing results against defined criteria. At Gable, they create a <em>Code Graph</em> that helps them get a skeleton view of the <strong>full data flow in code</strong>, without running any code. Connections, context, and business operations are expressed as code to be verified.</p>
<p>His hypothesis is that with agents at scale, we can gather datasets of behaviors such as logs of data pipelines, network logs, and other information such as <a href="https://objectways.com/blog/understanding-how-ai-agent-trajectories-guide-agent-evaluation/" target="_blank" rel="noopener noreffer">agent trajectories</a> and check based on them whether the data pipeline is compliant, like <a href="https://www.parloa.com/labs/research/ai-agent-testing/" target="_blank" rel="noopener noreffer">A/B testing AI Agents with a Bayesian Model</a>. This has yet to be proven, but the hypothesis is strong.</p>
<h3 id="deterministic-and-non-deterministic-work-in-data-engineering">Deterministic and Non-deterministic Work in Data Engineering</h3>
<p>When asked about his thoughts on functional data engineering where usually jobs are reproducible and restartable with new logic/source data, and how he sees the <strong>determinism</strong> with AI work (which has a different outcome every time), he said something interesting.</p>
<p>He said <strong>non-determinism is a benefit</strong>. That&rsquo;s why the setup is specs written in markdown, combined with configs and JSON that are specific, providing precision and accuracy. If anything goes wrong or not according to plan in the generation phase, he can just change the specs and <strong>achieve this determinism</strong> by spec-driven development.</p>
<p>But there are still some disadvantages from running non-deterministically, that&rsquo;s why he still does tests and comparisons manually, and checks visually whether everything works when running the pipeline.</p>
<h2 id="what-mark-thinks-about-ai">What Mark Thinks about AI</h2>
<p>When talking about the future, learning with AI or where it leads, or also when not to use AI, is what we discuss here.</p>
<h3 id="when-not-to-use-ai">When <em>not</em> to Use AI</h3>
<p>Starting with when he is <em><strong>not</strong></em> using AI, and when it&rsquo;s potentially cheaper or better to do it manually, his answer was:</p>
<blockquote>
<p>Requirements finding in an important project, again depends on use cases. For small non-personal projects, not a problem. But requirements need to be defined by stakeholders and come from a real problem</p>
</blockquote>
<p>Also, Mark mentioned key decisions for infrastructure code that needs to be <strong>stable and reliable</strong>. Or if used, he will spend much more time validating that LLM suggestions are correct.</p>
<p>For content online, he noticed that the writing always comes off differently than he would have phrased it. He might give it his insights to check or get feedback, but not the actual writing part.</p>
<h3 id="how-do-you-see-learning-with-ai">How Do You See Learning with AI?</h3>
<p>There&rsquo;s also the danger of not learning new things, and getting overwhelmed with constant stimulation, potentially getting slightly addicted. I asked Mark if he sees a problem in using agents and LLMs that would prevent us from learning new things as we are just cruising on auto-pilot.</p>
<p>Yes, he agreed. He calls it: &ldquo;<em>Claude code slot machine</em>&rdquo;, or &ldquo;<em>Lab rat</em>&rdquo;. &ldquo;<em>Getting your dopamine hit beyond usefulness</em>&rdquo; is how he would phrase it. He also thinks that this addictive behavior doesn&rsquo;t exist randomly. He thinks it is intended for us, the users, to use and spend more tokens (ergo money for them).</p>
<blockquote>
<p>[!note] Pseudo Work</p>
<p>Shipping lots of code with AI can feel like deep work, but if you&rsquo;re not learning in the process, it&rsquo;s pseudo work. <a href="https://www.ft.com/content/a8016c64-63b7-458b-a371-e0e1c54a13fc" target="_blank" rel="noopener noreffer">Problem-solving skills in adults are already declining</a>, and even studies showing short-term learning gains with AI find that <a href="https://www.nature.com/articles/s41599-025-04787-y" target="_blank" rel="noopener noreffer">beyond 8 weeks, the effect reverses as over-reliance sets in</a>.</p>
</blockquote>
<h3 id="the-future-of-cloud-vs-local-model">The Future of Cloud vs. Local Model</h3>
<p>My closing question was where things are heading, and whether self-healing data pipelines would be a thing. When some <a href="https://substack.com/home/post/p-189793289" target="_blank" rel="noopener noreffer">say</a> that &ldquo;Unironically, Rick Rubin is the future of work&rdquo; (where AI replaced a team of analysts, a strategist, a designer, a project manager, and a few weeks of work in minutes), the same goes for data analytics and data engineering.</p>
<p>Mark explains that when he was a data scientist, getting a nice histogram in Matplotlib or Seaborn took hours. Today he gets that for free, as an afterthought. He has built applications that pull leads from Hubspot, extend and aggregate data through RAG using APIs and pipeline logs, and for a board meeting just generate a static HTML page (with an export to CSV 😉). A <strong>custom-made visualization at your fingertips. That&rsquo;s the future</strong>, he says. Because below the visualization, there&rsquo;s a <strong>semantic model</strong> as the base. No one wants to open one more app, so based on well-defined semantics, AI can one-shot the visualization and integrate into existing workflows.</p>
<p>On the local model side, another future he sees (and is exploring himself) involves models running on a dedicated machine while he&rsquo;s away. He said the future is not about how powerful the models are, but <strong>how many iterations</strong> your spec has gone through. You <strong>run them until they are correct</strong>. You can also use RAG techniques to augment the model with your own notes and <a href="https://code.claude.com/docs/en/skills" target="_blank" rel="noopener noreffer">skills</a>, one local model custom-made for you:</p>
<blockquote>
<p><strong>You can&rsquo;t compete on compute</strong>, but you can use the factor of time, iterating multiple versions for a specified problem, and choosing the best one. Exactly what clinical research is doing and what he learned in his early career comparing studies.</p>
</blockquote>
<p>An interesting bleeding-edge area is running agents optimized for <strong>concurrency</strong>, chunking tasks and parallelizing them with smaller compute resources instead of one big model. <a href="https://www.linkedin.com/in/goabiaryan/" target="_blank" rel="noopener noreffer">Abi Aryan</a> is doing GPU research in exactly that field, and Mark recommends starting with this <a href="https://www.linkedin.com/posts/goabiaryan_%F0%9D%90%88%F0%9D%90%AD-%F0%9D%90%9A%F0%9D%90%A7%F0%9D%90%A7%F0%9D%90%A8%F0%9D%90%B2%F0%9D%90%AC-%F0%9D%90%A6%F0%9D%90%9E-%F0%9D%90%AD%F0%9D%90%A8-%F0%9D%90%A7%F0%9D%90%A8-%F0%9D%90%9E%F0%9D%90%A7%F0%9D%90%9D-%F0%9D%90%B0%F0%9D%90%A1%F0%9D%90%9E%F0%9D%90%A7-activity-7441123708452294656-AP00" target="_blank" rel="noopener noreffer">post</a>. While companies are paying 10x or more for cloud compute, local models with lots of iterations are increasingly feasible, and the economics are starting to make a strong case for them.</p>
<h2 id="next-interview">Next Interview</h2>
<p>I hope you enjoyed this interview with Mark. Huge thanks to Mark for taking the time to speak with me and for sharing his experience with all of us. Follow him on <a href="https://www.linkedin.com/in/mafreeman2/" target="_blank" rel="noopener noreffer">LinkedIn</a> and his <a href="https://www.linkedin.com/learning/instructors/mark-freeman" target="_blank" rel="noopener noreffer">Course on data quality</a> and check out his <a href="https://www.amazon.com/Data-Contracts-Developing-Production-Grade-Pipelines/dp/109815763X" target="_blank" rel="noopener noreffer">book</a>, its <a href="https://github.com/data-contract-book/chapter-7-implementing-data-contracts" target="_blank" rel="noopener noreffer">repo</a>, and much <a href="https://shift-left.gable.ai/m/mark-landing" target="_blank" rel="noopener noreffer">more</a>.</p>
<p>There are three more interviews already lined up with great guests, so please share feedback, questions you might want to ask or just your experience on how to work with AI in the data space. We&rsquo;re all in this together, figuring it all out. The more we can learn from each other, what&rsquo;s important, and maybe also what&rsquo;s not, the better.</p>
<p>So stay tuned for the next interview.</p>
<hr>
<pre class=""><em>Full article published at <a href="https://motherduck.com/blog/specs-over-vibes-consistent-ai-results/" target="_blank" rel="noopener noreferrer">MotherDuck.com</a> - written as part of <a href="/services">my services</a></em></pre>
]]></description>
</item>
<item>
    <title>Building an Agent-Friendly, Local-First Analytics Stack with MotherDuck and Rill</title>
    <link>https://www.ssp.sh/blog/agentic-friendly-local-first-analytics-stack/</link>
    <pubDate>Tue, 07 Apr 2026 08:41:06 &#43;0200</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/agentic-friendly-local-first-analytics-stack/</guid><enclosure url="https://www.ssp.sh/blog/agentic-friendly-local-first-analytics-stack/featured-image.png" type="image/png" length="0" /><description><![CDATA[<p>Imagine going from a 100-million-row dataset to an interactive analytics app with just a few prompts. What used to take hours or days can now be done in minutes by combining local-first databases and BI tools with an agentic coding workflow.</p>
<p>When Rill bet on YAML dashboards and CLI-first workflows in 2022, they weren&rsquo;t thinking about AI agents. Neither was MotherDuck when they built serverless DuckDB around the thesis that most data fits on a laptop. But it turns out, what is developer-friendly is also agent-friendly, with the needs of readable code, fast engines, and deterministic semantics.</p>
<p>Times are shifting rapidly toward CLI-first development. You know that&rsquo;s true when even email and calendar get their own Google CLIs. So why not have CLIs for your business metrics too?</p>
<p>This is what Rill and MotherDuck provide, including excellent developer workflows with a local-and-CLI-first approach, focusing on a developer-friendly interface and empowering users. Both work great on local laptops but can easily scale to the cloud, backed by a serverless data warehouse..</p>
<p>The convergence of embedded analytics engines (DuckDB/MotherDuck), declarative BI-as-code (Rill), and AI agent protocols (MCP) is creating a new architecture for business intelligence, one where dashboards become code, code becomes agent-readable, and analysts shift from clicking to prompting. And with 75% of cloud data warehouse queries scanning less than 1 GB<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>, this opportunity is great for agentic BI. In this article, we look at how we build agentic-friendly and local-first analytic stacks with MotherDuck, Rill, and agents.</p>
<blockquote>
<p>[!note] End to end examples later in the article</p>
<p>Later in the article we go through three different examples of how this can work, including GitHub repos and code examples, if that is something that&rsquo;s of interest to you.</p>
</blockquote>
<h2 id="why-these-two-tools">Why These Two Tools?</h2>
<p>Let&rsquo;s start with why do we use MotherDuck and Rill for <strong>agentic-first data tasks</strong>? As <a href="https://www.linkedin.com/in/cg1507/" target="_blank" rel="noopener noreffer">Ghanshyam Chodavadiya</a> from <a href="https://swym.ai/" target="_blank" rel="noopener noreffer">SWYM</a> says:</p>
<blockquote>
<p>[!quote] Quote on why SWYM use Rill with MotherDuck for their AI-native media decision platform:</p>
<p>[..] Rill lets us <strong>encode business context</strong> directly into our BI layer. Combined with MotherDuck and the Rill MCP client, it gives us <strong>flexible data control</strong> while powering automatically generated client dashboards and <strong>AI-driven insights</strong>.</p>
</blockquote>
<p>Both MotherDuck and Rill use a sophisticated architecture that focuses on developer workflows and scales from local development with declarative configuration to cloud (e.g. with <code>rill deploy</code>, or <code>md:</code> instead of <code>md</code>) or even embeds into your data CI/CD or agents pipeline, very easily. All of these reasons make them suitable for modern data requirements, where we need to iterate quickly but still have a strong foundation.</p>
<h3 id="local-first-approach-with-duckdb">Local-first Approach with DuckDB</h3>
<p>Both tools start from a local-first approach with DuckDB as the foundation.</p>
<p>For example, the <a href="https://www.inkandswitch.com/essay/local-first/" target="_blank" rel="noopener noreffer">Local-First principle</a>, that was tackled by Ink &amp; Switch and its community compares different strengths of local workflows meant for files, but also applies to data workloads. Even more so with AI agents, which can read the context from these projects and enhance easily with the use of strong CLIs that are available on the command line.</p>
<p>Or reading the <a href="https://motherduck.com/blog/small-data-manifesto/" target="_blank" rel="noopener noreffer">Small Data Manifesto</a> by MotherDuck that says &ldquo;<strong>Think small, develop locally, ship joyfully</strong>&rdquo;. If you enjoy some of these principles, this data stack with DuckDB/MotherDuck as the warehouse or <a href="https://www.rilldata.com/blog/scaling-beyond-postgres-how-to-choose-a-real-time-analytical-database" target="_blank" rel="noopener noreffer">real-time analytics</a> storage and Rill as an interactive, fast, and beautiful BI tool will suit you well.</p>
<p>MotherDuck works seamlessly through the DuckDB CLI, whether it is to connect through their serverless database in the cloud (connect with <code>duckdb ':md'</code>) or to open a fully fledged notebook environment locally with <code>duckdb --ui</code> (<a href="https://duckdb.org/docs/stable/core_extensions/ui" target="_blank" rel="noopener noreffer">try it</a>).</p>
<p>With Rill&rsquo;s YAML-based dashboard and metrics layer, and a powerful CLI, you can transform any of your data into a blazingly fast dashboard locally (run <code>rill start</code>), or from anywhere with the data on MotherDuck. Let&rsquo;s explore both in more detail, and show how users use it, and provide an example for you to get your hands on.</p>
<h2 id="what-is-motherduck">What Is MotherDuck?</h2>
<p>Before we go into the hands-on examples, let&rsquo;s answer the question of what MotherDuck and Rill are. And what&rsquo;s the difference from DuckDB and what do they bring to the table?</p>
<p>In its essence, it&rsquo;s a DuckDB-powered cloud data warehouse that scales to terabytes with ease. Just as Turso hosts SQLite, MotherDuck hosts DuckDB in the cloud, serverless for you. MotherDuck has <a href="https://youtu.be/xxCn7uhdDzw?si=RcBpqRAzZq0jiVHD&amp;t=215" target="_blank" rel="noopener noreffer">great relation</a> to DuckDB Labs, the company behind DuckDB and the <a href="https://duckdb.org/foundation/" target="_blank" rel="noopener noreffer">DuckDB Foundation</a>.</p>
<p>MotherDuck integrates well with DuckDB, but you can also just run DuckDB locally without it and manage your server yourself, open some ports, make it scale automatically if more queries come. But you&rsquo;d need to create an orchestration that scales out, handles OOM, servers, etc. So instead, MotherDuck provides all of it with simply pointing your local database to connect via <code>ATTACH 'md:'</code> compared to directly reading from a local DuckDB database (<code>duckdb path/to/file/db.duckdb</code>) or parquet files (<code>FROM nyc.parquet</code> or <code>FROM read_parquet('test.not-ending-with-parquet')</code>) that only you have access to.</p>
<p>The simplest way is to initially upload data to MotherDuck <strong>once, and then have access to the data from anywhere</strong> (see example later in the article).</p>
<blockquote>
<p>[!example] How to Upload data to MotherDuck from Local Storage</p>
<p>You can basically use different ways of synchronizing local DuckDB to MotherDuck via <code>COPY FROM DATABASE</code>, <code>CREATE OR REPLACE DATABASE ... FROM '&lt;path&gt;'</code>, from Parquet files, to using Python files - you can find more details at <a href="https://github.com/sspaeti/sync-duckdb-to-motherduck" target="_blank" rel="noopener noreffer">sync-duckdb-to-motherduck</a>.</p>
</blockquote>
<p>Besides simply replacing local DuckDB with a data warehouse like MotherDuck, MotherDuck has implemented a specific architecture called <a href="https://motherduck.com/docs/key-tasks/running-hybrid-queries/" target="_blank" rel="noopener noreffer">dual execution</a>. It&rsquo;s built on top of their 1.5-Tier Architecture. It&rsquo;s a novel 1.5-tier architecture powered by <a href="https://duckdb.org/docs/stable/clients/wasm/overview" target="_blank" rel="noopener noreffer">WebAssembly (Wasm)</a>. Unlike the more traditional 3-Tier architecture that operates between the client and server, the 1.5-tier directly returns the request in the client (browser), reducing latency for server requests and network round trips.</p>













  


























  
<figure>
<a target="_blank" href="/blog/agentic-friendly-local-first-analytics-stack/motherduck-architecture.png" title="/blog/agentic-friendly-local-first-analytics-stack/motherduck-architecture.png">

</a><figcaption class="image-caption">Image from <a href="https://motherduck.com/product/app-developers" target="_blank" rel="noopener noreffer">MotherDuck: For App Developers</a></figcaption>
</figure>
<p>Traditional applications are built on a 3-Tier Architecture, which requires several intermediary operations to run between the end user interface, server, and underlying database. MotherDuck’s <a href="https://motherduck.com/product/app-developers/#architecture" target="_blank" rel="noopener noreffer">1.5-tier architecture</a> has the same DuckDB engine running inside the user’s web browser and in the cloud.</p>
<p>The developers can move data closer to the user to create analytics experiences that run <a href="https://motherduck.com/blog/introducing-instant-sql/" target="_blank" rel="noopener noreffer">instantly</a> with the benefit of still scaling with MotherDuck as the backend. Check their CIDR paper on <a href="https://www.cidrdb.org/cidr2024/papers/p46-atwal.pdf" target="_blank" rel="noopener noreffer">DuckDB in the cloud and in the client</a>, on how this works in detail.</p>
<h3 id="what-does-the-dual-execution-do">What Does the Dual Execution Do?</h3>
<p>Since the initial paper, the dual execution has evolved and makes MotherDuck more than just &ldquo;DuckDB in the cloud&rdquo;. When you <code>ATTACH 'md:'</code> locally or in the web, you get a two-node distributed system that automatically routes query stages to wherever they run best.</p>
<p>An example with using dbt: In your <code>sources.yml</code> of dbt you can simply define <code>dev</code> with DuckDB and <code>prod</code> with MotherDuck like this:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">your-project</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nt">target</span><span class="p">:</span><span class="w"> </span><span class="l">prod</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nt">outputs</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="nt">dev</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">			</span><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l">duckdb</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">			</span><span class="nt">schema</span><span class="p">:</span><span class="w"> </span><span class="l">project_dev</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">			</span><span class="nt">path</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;path/locally.duckdb&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">			</span><span class="nt">thread</span><span class="p">:</span><span class="w"> </span><span class="m">1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="l">...</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="nt">prod</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">			</span><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l">duckdb</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">			</span><span class="nt">schema</span><span class="p">:</span><span class="w"> </span><span class="l">project</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">			</span><span class="nt">path</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;md:prod_project&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">			</span><span class="nt">thread</span><span class="p">:</span><span class="w"> </span><span class="m">1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="l">...</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>Smaller data gets processed locally with millisecond response times and if needed, can extend to run in the cloud, using cross-environment joins that transfer only the necessary intermediate results. As a user, you won&rsquo;t notice the difference in using simple DuckDB.</p>
<p>This uses your laptop&rsquo;s power with DuckDB as a first-class compute node. As MotherDuck CEO Jordan Tigani put it: &ldquo;Laptops these days are extremely powerful and you can get answers in a handful of milliseconds, whereas if you had to ask a cloud service, the initial request wouldn&rsquo;t have even gotten there.&rdquo;</p>
<p>In a way, MotherDuck is a lightweight alternative to Spark for single-node or moderately-sized analytical workloads (it does not support distributed, multi-node processing on massive datasets), but it&rsquo;s far easier to set up, has no cluster management, and scales to terabytes. Without the setup cost or the operational burden for tasks that don&rsquo;t need the massive scale Spark provides, you get an out-of-the-box data warehouse that handles scale very conveniently for us users.</p>
<blockquote>
<p>[!example] One more advantage is multi-user collaboration</p>
<p>DuckDB is single-writer, and MotherDuck is what unlocks the &ldquo;multiplayer&rdquo; angle with its integrated notebooks (which are also available locally with <code>duckdb --ui</code>, but not shareable on the web).</p>
</blockquote>
<h2 id="why-rill">Why Rill?</h2>
<p>So what does Rill bring to the table, and why do they work so well with MotherDuck?</p>
<p>If MotherDuck gives you a data warehouse that feels like a local database, Rill <strong>gives you a BI tool that feels like a code editor</strong> getting you up and running with a single binary you can start with. The name Rill is from an old English word and meaning &ldquo;stream&rdquo;, and it has strong templates integrated that scaffold any BI requirements in seconds. Both MotherDuck and Rill are built on the conviction that focusing on developer experience will help data teams implement great data solutions that not only work, but are fun.</p>
<p>Rill&rsquo;s core idea is simple: define your entire BI stack (data sources, SQL models, metrics (<code>total_revenue: sum(amount)</code>), dashboards) as YAML and SQL files in a Git repository. Start simply with one command (<code>curl https://rill.sh | sh</code>), run <code>rill start</code>, and you have a local development environment backed by an embedded DuckDB instance delivering sub-second queries. Push to Git, deploy to Rill Cloud, and your dashboards are live. The same declarative files, the same SQL, the same metrics, just a different runtime.</p>
<p><div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/t7Igf3JTflc?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>
<br>
<em>A quick video of showcasing Rill&rsquo;s BI-as-Code power and how Claude Code can be used to collaborate with your BI tools, easily.</em></p>
<p>Moreover, this &ldquo;BI-as-Code&rdquo; approach turns out to be exactly what makes Rill a natural companion for <a href="https://www.rilldata.com/blog/data-modeling-for-the-agentic-era-semantics-speed-and-stewardship" target="_blank" rel="noopener noreffer">agentic workflows</a>, because all artifacts for data and BI are defined declaratively and locally, any agent can use them for context and build autonomously on top, while still letting the user verify that everything is correct and works by quickly running the Rill CLI locally or in a CI/CD pipeline.</p>
<p>Rill embeds DuckDB under the hood. Connecting it to MotherDuck requires nothing more than a YAML connector config pointing at <code>md:my_database</code> with a token property such as `token: &ldquo;{{ .env.CONNECTOR_MOTHERDUCK_TOKEN }}&rdquo;.</p>
<p>The SQL models are identical, meaning no syntax changes, no migration, no new query dialect to learn. The only thing that changes is where the data lives, but that has no effect on the user experience.</p>
<p>Rill was built around the idea of:</p>
<blockquote>
<p>That instead of hiding business logic in a proprietary GUI that only humans can click through, you make it readable code that anyone, or anything, can openly read and reason about.</p>
</blockquote>
<p>Rill with its YAML dashboards and a CLI-first workflow has positioned itself perfectly for working with AI agents. That wasn&rsquo;t planned or foreseen, but it turns out that context is king and that tools designed for developer simplicity are exactly what agents need with <strong>readable definitions and fast deterministic engines</strong>.</p>
<p>The result is a stack where your metrics are a source of truth you can version, audit, and feed directly into an agent&rsquo;s context window, and where switching from a local DuckDB file to a serverless cloud warehouse is a one-line change.</p>
<h3 id="metrics-and-sql">Metrics and SQL</h3>
<p>We all probably agree that metrics are the key to codify business knowledge, giving us all the benefits of a software design approach (versioned, automatable, testable, etc.) and automating metrics with an agent, while still being able to define complex metrics ourselves if needed. And an <strong>agentic-friendly environment</strong> where agents get their concise context and collaborate with a domain expert and human in the loop.</p>
<p>It&rsquo;s good to know that Rill provides an integrated <a href="https://www.rilldata.com/blog/why-you-need-a-sql-based-metrics-layer" target="_blank" rel="noopener noreffer">Metrics Layer</a>. It&rsquo;s excellent as it&rsquo;s just a YAML file too. Meaning you can integrate it into other notebooks, or data apps easily, but also just with Rill for building multiple dashboards, conversational analytics, and canvas on top of a <strong>unified metrics layer</strong>.</p>
<p>Instead of integrating complex metrics multiple times in different dashboards, we can just reference the metrics layer. Besides the metrics, we need fast, instant responses, even more so when we let agents work autonomously. We can&rsquo;t wait 5-10 minutes for a <strong>simple question</strong> until all research through the agents is done.</p>
<blockquote>
<p>[!info]  Metrics SQL: A language built on top of SQL</p>
<p>Additionally, Rill offers a dedicated query language that extends on the strength of SQL, called <a href="https://www.rilldata.com/blog/data-modeling-for-the-agentic-era-semantics-speed-and-stewardship#metrics-sql-a-sql-based-semantic-layer" target="_blank" rel="noopener noreffer">Metrics SQL</a>. It&rsquo;s a dedicated SQL dialect designed for querying data from <a href="https://docs.rilldata.com/developers/build/metrics-view/what-are-metrics-views" target="_blank" rel="noopener noreffer">Metrics Views</a>.</p>
<p>I wrote more at a <a href="https://www.rilldata.com/blog/data-modeling-for-the-agentic-era-semantics-speed-and-stewardship#metrics-sql-a-sql-based-semantic-layer" target="_blank" rel="noopener noreffer">SQL-Based Semantic Layer</a>, but it helps to simplify your SQL as it empowers the SQL language, or learn more about the philosophy from Mike, the CEO of Rill, in an <a href="https://www.youtube.com/watch?v=tEIQGgS4Zus" target="_blank" rel="noopener noreffer">insightful podcast</a> with Joe Reis where they talk about the future of dashboards and how agents and navigating the new era of BI and analytics works.</p>
</blockquote>
<h3 id="conversational-bi-rill-turns-dashboards-into-code-and-code-into-agent-readable-context">Conversational BI: Rill Turns Dashboards into Code (and Code into Agent-readable context)</h3>
<p>When looking at <a href="https://www.rilldata.com/blog/has-self-serve-bi-finally-arrived-thanks-to-ai" target="_blank" rel="noopener noreffer">Conversational BI</a> and its benefits, we can say that &ldquo;Conversations can generate code, and code generates insights&rdquo;. Code is the best abstraction, and with agents, we can easily make it available to non-programmers.</p>
<p><strong>Code as the abstraction layer</strong> in most cases. But why? Because if you create a hard-coded interface language, or an API, you can only do what you need. With code (usually Python, or SQL in this case too), we can do much more. We can use all the functions of the language versus the implemented API. It&rsquo;s easier to maintain, and also automate.</p>
<p>RudderStack <a href="https://www.rudderstack.com/blog/ai-data-infrastructure-as-code/" target="_blank" rel="noopener noreffer">reinforced</a> this narrative from the infrastructure side:</p>
<blockquote>
<p>Most of today’s commercial data tools are designed for humans, not for automation. Their primary interfaces are web dashboards, which are convenient for analysts, but opaque to code.</p>
</blockquote>
<p>This means if we want agent tools to analyze our code base, we need to let them access our code, or in the case of Rill, declaratively defined dashboards, metrics in the metrics layer, and data sources.</p>
<p>We can do even more when we use the chat interface to interact, making it usable for humans again, making it usable for humans again by using <strong>natural language as the primary interface</strong>. That&rsquo;s where Rill offers an extensive integration with agent workflows through generating a dashboard based on existing sources, models, and defined metrics in the metrics layer:</p>













  
<figure><a target="_blank" href="/blog/agentic-friendly-local-first-analytics-stack/nyc-trips-ai.webp" title="">

</a><figcaption class="image-caption">Image of how easy it is to generate a dashboard based on an existing model. See also more at <a href="https://www.rilldata.com/blog/bi-as-code-and-the-new-era-of-genbi" target="_blank" rel="noopener noreffer">BI-as-Code and the New Era of GenBI</a></figcaption>
</figure>
<p>These features are also integrated as <strong>Conversational BI</strong>, letting you explore your business numbers with the interface of a natural language and chat. With Cursor and agentic code-like suggestions, but referring to pre-defined metrics for asking specific questions:</p>













  
<figure><a target="_blank" href="/blog/agentic-friendly-local-first-analytics-stack/conversational-bi.webp" title="">

</a><figcaption class="image-caption">Showcase of Conversational BI in Rill</figcaption>
</figure>
<p>Here&rsquo;s a full video of what you can do with it.</p>
<p>The chat interface provides charts that can be further explored or integrated into your dashboards. What I like most is that in the responses, you can just click on them (e.g. 1.) and it will appear as a pivot table in (2.) where you can dig into more details by adding more dimensions and metrics:</p>













  
<figure><a target="_blank" href="/blog/agentic-friendly-local-first-analytics-stack/rill-conversational-charts.webp" title="">

</a><figcaption class="image-caption">Showcase of Conversational BI in Rill</figcaption>
</figure>
<p>Showcase of how interactive dashboards are created on the fly, to explore and open in a pivot table directly inside Rill. The links are clickable, see a short video of how that looks in action:<br>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/3RXwd-1o66Q?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>
</p>
<blockquote>
<p>[!note] You can also use Claude integration with MCP</p>
<p>Rill&rsquo;s MCP can also be used, see <a href="https://youtu.be/ZmgVkKImxs8?si=ECh72nAtg1LcF6uy" target="_blank" rel="noopener noreffer">Showcase of Rill MCP with Claude Desktop - YouTube</a>. Or Cursor if that is your preferred AI-based IDE at <a href="https://youtu.be/Th5Krj14DCI?si=fE04Q7F_1pbCd_AM" target="_blank" rel="noopener noreffer">BI-as-Code and the New Era of GenBI Demo - YouTube</a>.</p>
</blockquote>
<p>Let&rsquo;s now look at different analytics use cases and implementations with these handy features combined with MotherDuck as the backend.</p>
<h2 id="motherduck--rill-in-action-three-examples">MotherDuck + Rill in Action: Three Examples</h2>
<p>Let&rsquo;s now look at actual implementations with these features combined. We&rsquo;ll start with how to connect Rill to MotherDuck, then walk through two open-source examples you can try yourself, and finish with a real-world customer showcase.</p>
<h3 id="connecting-rill-to-motherduck">Connecting Rill to MotherDuck</h3>
<p>Since Rill <a href="https://docs.rilldata.com/developers/build/connectors/olap/motherduck" target="_blank" rel="noopener noreffer">already embeds</a> DuckDB, connecting to MotherDuck requires only a four-line YAML connector config with a token and a <code>md:</code> path. Add a <code>motherduck.yaml</code> to your <code>connectors/</code> folder:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l">connector</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">driver</span><span class="p">:</span><span class="w"> </span><span class="l">duckdb</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">mode</span><span class="p">:</span><span class="w"> </span><span class="l">readwrite</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">token</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;{{ .env.CONNECTOR_MOTHERDUCK_TOKEN }}&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">path</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;md:my_database&#34;</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>Compare that to a local DuckDB connector:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l">connector</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">driver</span><span class="p">:</span><span class="w"> </span><span class="l">duckdb</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">dsn</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;my_database.duckdb&#34;</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>The only difference is <code>token</code> + <code>path: &quot;md:...&quot;</code> instead of <code>dsn</code>. Set the token as an environment variable (see <a href="https://docs.rilldata.com/developers/build/connectors/olap/motherduck" target="_blank" rel="noopener noreffer">Rill docs</a> for details), and your SQL models, metrics, and dashboards work identically — whether the data lives on your laptop or in MotherDuck&rsquo;s cloud.</p>
<p>In 2025, Rill <a href="https://www.rilldata.com/blog/rill-in-review-top-features-that-shaped-2025" target="_blank" rel="noopener noreffer">significantly strengthened its native connectivity</a> to enable zero-copy, blazingly fast analytics without moving data, making this connection even more seamless.</p>
<h3 id="stack-overflow-developer-survey-zero-pipeline-analytics">Stack Overflow Developer Survey: Zero-Pipeline Analytics</h3>
<p>The simplest way to experience MotherDuck + Rill is with data you already have. Every free MotherDuck account ships with <code>sample_data.stackoverflow_survey.survey_results</code><sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>  — 600k+ professional developer responses from 2019-2024. No ETL needed.</p>
<p>The <a href="https://github.com/sspaeti/motherduck-rill" target="_blank" rel="noopener noreffer">motherduck-rill</a> project builds a complete analytics stack on this data: 4 SQL models (staging, technology usage, developer profiles, database analysis), 3 metrics views with 17+ measures, and 3 canvas dashboards — all as pure SQL + YAML in a Git repo. No Python, no orchestrator, no data pipeline.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">git clone https://github.com/sspaeti/motherduck-rill.git <span class="o">&amp;&amp;</span> <span class="nb">cd</span> motherduck-rill
</span></span><span class="line"><span class="cl">cp .env.example .env
</span></span><span class="line"><span class="cl"><span class="c1"># Add your MotherDuck token to .env</span>
</span></span><span class="line"><span class="cl">rill start
</span></span><span class="line"><span class="cl"><span class="c1"># Open http://localhost:9009</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>That&rsquo;s it. One command and you&rsquo;re exploring which databases are most desired in the US, which languages pay the highest salaries, or how AI adoption shifted across years — all backed by MotherDuck&rsquo;s serverless cloud.</p>
<p>Answering the question of &ldquo;Which databases are most desired in the US according to Stack Overflow&rdquo;:<br>












<a target="_blank" href="/blog/agentic-friendly-local-first-analytics-stack/stackoverflow-dashboard.webp" title="">

</a></p>
<p>The SQL models use standard DuckDB syntax throughout. For example, the <code>database_analysis</code> model unnests semicolon-separated survey responses into one row per database per relationship type (used, admired, desired), then the metrics view aggregates them with <code>COUNT(DISTINCT ResponseId)</code>. The same SQL runs locally via embedded DuckDB or via MotherDuck — the connector config is the only difference. This is what BI-as-Code looks like in practice.</p>
<p>The above example could be created based on a couple of simple prompts, as Rill&rsquo;s definitions are all local and the data accessible through MotherDuck through DuckDB CLI.</p>
<h3 id="multi-cloud-cost-analyzer-production-grade-connector-switching">Multi-Cloud Cost Analyzer: Production-Grade Connector Switching</h3>
<p>For a more production-like setup, I use the <a href="https://github.com/ssp-data/cloud-cost-analyzer" target="_blank" rel="noopener noreffer">cloud-cost-analyzer</a> I&rsquo;ve built in a previous edition to visualize your costs from different hyperscalers with ClickHouse, and now added MotherDuck.</p>
<p>This project shows how MotherDuck fits into a real data pipeline alongside local DuckDB and ClickHouse. Same <a href="https://dlthub.com/" target="_blank" rel="noopener noreffer">dlt</a> pipelines, same Rill dashboards where I just added a different destination to dlt. Zero pipeline code changes were needed.</p>
<p>The key insight: MotherDuck uses DuckDB SQL syntax, so the SQL models share all functions with local DuckDB. Only the <code>FROM</code> clause differs — <code>read_parquet('...')</code> locally vs <code>schema.table</code> on MotherDuck. The Rill SQL models use a 3-way conditional to switch to work around these small differences between connectors:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="err">{{</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="n">eq</span><span class="w"> </span><span class="p">.</span><span class="n">env</span><span class="p">.</span><span class="n">RILL_CONNECTOR</span><span class="w"> </span><span class="s2">&#34;motherduck&#34;</span><span class="w"> </span><span class="err">}}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="k">CAST</span><span class="p">(</span><span class="n">SPLIT_PART</span><span class="p">(</span><span class="n">identity_time_interval</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;T&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="nb">DATE</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="nb">date</span><span class="p">,</span><span class="w"> </span><span class="o">*</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="n">aws_costs</span><span class="p">.</span><span class="n">cur_export_test_00001</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="err">{{</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="n">eq</span><span class="w"> </span><span class="p">.</span><span class="n">env</span><span class="p">.</span><span class="n">RILL_CONNECTOR</span><span class="w"> </span><span class="s2">&#34;clickhouse&#34;</span><span class="w"> </span><span class="err">}}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">toDate</span><span class="p">(</span><span class="n">splitByChar</span><span class="p">(</span><span class="s1">&#39;T&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">identity_time_interval</span><span class="p">)[</span><span class="mi">1</span><span class="p">])</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="nb">date</span><span class="p">,</span><span class="w"> </span><span class="o">*</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="n">aws_costs___cur_export_test_00001</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="err">{{</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="err">}}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="k">CAST</span><span class="p">(</span><span class="n">SPLIT_PART</span><span class="p">(</span><span class="n">identity_time_interval</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;T&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="nb">DATE</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="nb">date</span><span class="p">,</span><span class="w"> </span><span class="o">*</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="n">read_parquet</span><span class="p">(</span><span class="s1">&#39;data/aws_costs/cur_export_test_00001/*.parquet&#39;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="err">{{</span><span class="w"> </span><span class="k">end</span><span class="w"> </span><span class="err">}}</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>Notice how MotherDuck and local DuckDB share the exact same SQL functions (<code>CAST</code>, <code>SPLIT_PART</code>) — only the <code>FROM</code> source changes. ClickHouse needs its own dialect (<code>toDate</code>, <code>splitByChar</code>). This is the practical advantage of MotherDuck being DuckDB in the cloud: your SQL models stay the same.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">git clone git@github.com:ssp-data/cloud-cost-analyzer.git
</span></span><span class="line"><span class="cl"><span class="nb">cd</span> cloud-cost-analyzer
</span></span><span class="line"><span class="cl">uv sync
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 1. Add your MotherDuck token to .dlt/secrets.toml</span>
</span></span><span class="line"><span class="cl"><span class="c1">#    [destination.motherduck.credentials]</span>
</span></span><span class="line"><span class="cl"><span class="c1">#    password = &#34;eyJ...&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 2. Add token to viz_rill/.env</span>
</span></span><span class="line"><span class="cl"><span class="c1">#    CONNECTOR_MOTHERDUCK_TOKEN=&#34;eyJ...&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 3. Load data into MotherDuck (same pipelines, different destination)</span>
</span></span><span class="line"><span class="cl">make run-etl-motherduck
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 4. Start Rill dashboards backed by MotherDuck</span>
</span></span><span class="line"><span class="cl">make serve-motherduck
</span></span></code></pre></td></tr></table>
</div>
</div><p>Under the hood, <code>DLT_DESTINATION=motherduck</code> tells dlt to write to MotherDuck instead of local parquet files. The same data is then also visible in the <a href="https://app.motherduck.com/" target="_blank" rel="noopener noreffer">MotherDuck web UI</a> for ad-hoc querying alongside the Rill dashboards.</p>













  
<figure><a target="_blank" href="/blog/agentic-friendly-local-first-analytics-stack/motherduck-ui.webp" title="">

</a><figcaption class="image-caption">Showcase of querying same data with MotherDuck&rsquo;s Notebook UI.</figcaption>
</figure>
<p>This pattern — <code>make serve-motherduck</code> vs <code>make serve</code> vs <code>make serve-clickhouse</code> — shows what switching from local to cloud looks like in a CLI-first stack.</p>
<h3 id="driotech-agentic-analytics-in-production">Driotech: Agentic Analytics in Production</h3>
<p>Beyond my own demos, <a href="https://www.youtube.com/watch?v=i7dHS0XxW8U" target="_blank" rel="noopener noreffer">Salomon from Driotech</a> showcased how they use Rill for agentic analytics with client data using MotherDuck in combination with Airbyte, dlt, BigQuery and dbt.</p>
<p>In his webinar on empowering businesses with <strong>agentic analytics</strong>, he walked through a B2B sales use case that highlights exactly the principles we discussed above.</p>
<p><div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/i7dHS0XxW8U?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>
<br>
<em>Video walkthrough with the example of Airbyte/dlt + BigQuery/MotherDuck/dbt + Rill.</em></p>
<p>His key takeaways align perfectly with the MotherDuck + Rill thesis:</p>
<ol>
<li><strong>Metrics as the foundation</strong>: Before any AI agent can work reliably, you need clearly defined KPIs and a single source of truth. &ldquo;If you feed any AI agent with a mess, you&rsquo;re going to end up with an even bigger mess.&rdquo; → This is exactly what Rill&rsquo;s metrics layer provides: versioned, agreed-upon definitions in YAML.</li>
<li><strong>Governance as guardrails</strong>: The agent doesn&rsquo;t hallucinate on business concepts because the semantic model constrains what it can query. When asking &ldquo;what are our shipping costs?&rdquo; → The agent looks up the exact metric definition, avoiding guessing from raw data.</li>
<li><strong>From reactive to proactive</strong>: Beyond just answering questions, the demo showed creating alerts (&ldquo;if customer orders drop to zero in 14 days, email the account manager&rdquo;) and scheduled reports → Pushing insights to stakeholders automatically.</li>
<li><strong>Code-defined dashboards</strong>: Salomon explicitly called out that Rill&rsquo;s code-based approach means &ldquo;AI agents are able to build the dashboards for us&rdquo; → Because the language of AI is code, and Rill&rsquo;s dashboards are just that.</li>
</ol>
<p>The above webinar reinforces a central point we&rsquo;ve discussed above that conversational BI can be used today in Rill. When you have a solid semantic foundation (metrics in Rill), a scalable backend (MotherDuck), and a code-first workflow, the agentic layer becomes practical, trustworthy, and something you can deploy to real users today.</p>
<div class="details admonition note open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-note"></i>More Interesting Resources<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content"><ul>
<li>Upload a local DuckDB to MotherDuck in two lines: <code>ATTACH 'md:'; CREATE DATABASE my_db FROM 'local.duckdb';</code></li>
<li><a href="https://www.youtube.com/watch?v=tEIQGgS4Zus" target="_blank" rel="noopener noreffer">Dashboards vs. Agents podcast with Mike Driscoll and Joe Reis</a></li>
<li><a href="https://motherduck.com/blog/small-data-manifesto/" target="_blank" rel="noopener noreffer">The Small Data Manifesto</a> by MotherDuck</li>
<li><a href="https://youtu.be/10d8HxS4y_g?si=wYZKVTs5IMxXrxci" target="_blank" rel="noopener noreffer">Local-First Software mini-documentary</a> by CultRepo (previously Honeypot)</li>
</ul>
</div>
        </div>
    </div>
<h2 id="whats-next-dashboards-no-more">What&rsquo;s Next, Dashboards no More?</h2>
<p>Seeing these examples, and where we are heading with agentic BI, one might ask now, <strong>do we even still need dashboards?</strong> We can just ask the chatbot?</p>
<p>I agree with Mike&rsquo;s <a href="https://www.youtube.com/watch?v=tEIQGgS4Zus" target="_blank" rel="noopener noreffer">argument</a> that: &ldquo;No, well-crafted, and especially operational dashboards, they will never go away&rdquo;. Because a visualization can provide so much more <strong>condensed information</strong> in a couple of seconds that a chat never will, or would require many back-and-forth chats.</p>
<p>There&rsquo;s also a difference between &ldquo;a&rdquo; dashboard and the dashboards. E.g. more than half of dashboards are just exploratory in nature to quickly explore the business or data. Additionally, the sales dashboards that someone spent weeks or months, sometimes years for large companies, to perfect and ensure the correct numbers, deriving key decisions for the business from it or defining a sales rep&rsquo;s salary as how much they sold usually is tracked in a dashboard as well.</p>
<p>One more thing that chatbots can&rsquo;t replace is so-called <strong>&ldquo;drilling down&rdquo;</strong>, and <strong>using data as a REPL</strong> with <a href="https://docs.rilldata.com/guide/dashboards/explore/pivot" target="_blank" rel="noopener noreffer">pivot tables</a> - as <a href="https://www.rilldata.com/blog/why-pivot-tables-never-die" target="_blank" rel="noopener noreffer">they never die</a>. Quickly drilling down to the lowest details and back up to the aggregated data within seconds, simply dragging and dropping dimensions and measures around.</p>
<blockquote>
<p>[!tip] People don&rsquo;t know what they are looking for<br>
Benn Stancil <a href="https://benn.substack.com/p/which-way-from-here" target="_blank" rel="noopener noreffer">argued</a>, &ldquo;the challenge with data exploration is not that people don&rsquo;t have the ability to manipulate data; it&rsquo;s that they don&rsquo;t know what they&rsquo;re looking for.&rdquo;</p>
</blockquote>
<h3 id="natural-language-interface-convenient-but-inaccurate">Natural Language Interface: Convenient, but Inaccurate?</h3>
<p>With agents, we can use natural language as an interface to input domain knowledge, and agents will do the technical translation and implement it in such a code and declarative-first approach, where the context is clearly and distinctly defined - much more than natural language, which contains lots of nuances and ambiguity.</p>
<p>So with that approach, the agent will put it into deterministic YAML, that can then be reviewed, tested and automated against. So we move from Human to agents to context to iteration and finally, visualization:</p>
<p>






<br>
Similar to what we discussed in <a href="https://www.rilldata.com/blog/bi-as-code-and-the-new-era-of-genbi" target="_blank" rel="noopener noreffer">GenBI</a>, iterating much faster than with the traditional, non-generative way:<br>













  
<figure><a target="_blank" href="/blog/agentic-friendly-local-first-analytics-stack/genbi-workflow-prompt-generate-ship.webp" title="">

</a><figcaption class="image-caption">BI-as-Code with agents that: 1. Prompt 2. Generate 3. Ship</figcaption>
</figure></p>
<h3 id="self-serve-with-bi-as-code-providing-the-context">Self-Serve with BI-as-Code providing the <em>Context</em></h3>
<p>So what&rsquo;s next? Are we arriving at self-service BI finally? (the never-ending promise🙂).</p>
<p>Agents with natural language solve a big problem that self-serve always strived for: Giving each domain user who is less technical an edge to do self-serve themselves, and potentially even go further and fix the data by prompting the data pipeline to fix the correct timestamp, or update a dbt model with the right table source.</p>
<p>With domain experts doing more developer work with agents, we can combine domain knowledge with coding abilities of agents, bridged by natural language with <strong>BI-as-Code providing the semantic context</strong> that includes models, metrics and even dashboards in plain YAML.</p>
<p>Context is king for the near future. Everything that can be locally defined, such as Rill&rsquo;s metrics layer and dashboards, will be so much faster and better built with agents.</p>
<p>But BI is not the only domain noticing this power, <a href="https://www.linkedin.com/in/kurtbuhler/" target="_blank" rel="noopener noreffer">Kurt Buhler</a> of Tabular Editor <a href="https://tabulareditor.com/blog/ai-agents-with-command-line-tools-to-manage-semantic-models" target="_blank" rel="noopener noreffer">wrote</a>: &ldquo;CLI tools provide an alternative way to interact with software in a terminal by writing and executing commands. This command-line interface is very suitable for agents.&rdquo;</p>
<h3 id="sql-yaml-and-why-language-choices-matter">SQL, YAML: And why Language Choices Matter</h3>
<p>Also, with Rill and MotherDuck we choose <strong>SQL as our primary language</strong> and interface and <strong>YAML as structured format</strong> to store. And the language choice matters because the training data for widely adopted languages like SQL is larger, thus LLMs are better at generating BI-as-code in SQL than DAX, LookML, or some obscure language.</p>
<p>Wes McKinney even <a href="https://wesmckinney.com/blog/agent-ergonomics/" target="_blank" rel="noopener noreffer">argues</a> that AI agents are enabling him to build software in languages like Go and Swift, despite his lack of prior experience. He says that &ldquo;human ergonomics in programming languages matters much less now,&rdquo; as agents prioritize fast compile-test cycles and frictionless distribution, favoring languages like Go and Rust over Python for new systems. Interested in more, check the <a href="https://wesmckinney.com/transcripts/2026-02-10-rill-data-podcast" target="_blank" rel="noopener noreffer">conversation with Wes and Mike</a>.</p>
<h3 id="limitations">Limitations</h3>
<p>The one limitation for the future of data might be the <strong>imprecise way of natural language</strong> and how we communicate. For example: &ldquo;give me the analytics for this week?&rdquo; Did you mean &ldquo;from today until last week&rdquo;? Or &ldquo;full weeks Monday to Sunday&rdquo;? Or &ldquo;starting from midnight, or during the day&rdquo;? So many unknowns and misinterpretations possible.</p>
<p>The other is that the future of data needs to be <strong>deterministic and reproducible</strong>, to backfill faulty data, but AI agents are the opposite. And that can be challenging.</p>
<h2 id="text-based-local-first-the-architecture-agents-need">Text-Based, Local-First: The Architecture Agents Need</h2>
<p>The connectors between source and destination are getting more flexible, more fluid, self-healing. Knowledge workers might soon be able to not only figure out the problem, but also act on it directly: &ldquo;send an email to&hellip;&rdquo; to fix the problem.</p>
<p>As the examples in this article show, we&rsquo;ve come a long way with <a href="https://www.rilldata.com/blog/has-self-serve-bi-finally-arrived-thanks-to-ai" target="_blank" rel="noopener noreffer">Self-Serve BI</a>, and we might already be there. Keep in mind the <strong><a href="https://www.rilldata.com/blog/data-modeling-for-the-agentic-era-semantics-speed-and-stewardship" target="_blank" rel="noopener noreffer">three pillars: semantics, stewardship, speed</a></strong> as you work in the <a href="https://www.rilldata.com/blog/data-modeling-for-the-agentic-era-semantics-speed-and-stewardship" target="_blank" rel="noopener noreffer">Agentic Era</a>.</p>
<p>The MotherDuck + Rill story is ultimately about the data industry discovering that the tools best suited for AI agents are the same tools that respect simplicity, transparency, and developer ergonomics.</p>
<p>The &ldquo;small data&rdquo; thesis didn&rsquo;t anticipate the AI agent revolution, but it created the conditions for it: when your data fits on a laptop and your dashboards are YAML files, an AI agent can read, reason about, and act on your entire analytics stack.</p>
<p>The irony is that going back to local-first, text-based, SQL-defined analytics turns out to be the most forward-looking architecture. And dashboards become agents when they&rsquo;re written as code.</p>
<p>&ndash;</p>
<p>If any of these interest you more, also check out related articles around conversational BI, BI as code, and how AI can self-serve us in the world of BI:</p>
<ul>
<li><a href="/blog/bi-as-code-and-genbi/" rel="">BI-as-Code and the New Era of GenBI</a></li>
<li><a href="/blog/agentic-data-modeling/" rel="">Data Modeling for the Agentic Era: Semantics, Speed, and Stewardship</a></li>
<li><a href="/blog/self-service-bi-ai/" rel="">Has Self-Serve BI Finally Arrived Thanks to AI?</a></li>
</ul>
<hr>
<pre class=""><em>Full article published at <a href="https://www.rilldata.com/blog/building-an-agent-friendly-local-first-analytics-stack-with-motherduck-and-rill" target="_blank" rel="noopener noreferrer">Rilldata.com</a> - written as part of <a href="/services">my services</a></em></pre>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>See <a href="https://assets.amazon.science/7d/d6/b0e0ff5749ceb42ca6a8437038bc/why-tpc-is-not-enough-an-analysis-of-the-amazon-redshift-fleet.pdf" target="_blank" rel="noopener noreffer">Redshift Files</a>&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>If you don&rsquo;t see the sample_data database, your account may predate the sample data being added&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></description>
</item>
<item>
    <title>Why I Still Blog — and Why the Future of Blogging Is Connected</title>
    <link>https://www.ssp.sh/blog/why-i-still-blog/</link>
    <pubDate>Fri, 06 Mar 2026 20:00:17 &#43;0100</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/why-i-still-blog/</guid><enclosure url="https://www.ssp.sh/blog/why-i-still-blog/featured-image.png" type="image/png" length="0" /><description><![CDATA[<p>I&rsquo;ve been online twenty years, and blogging for ten of them. This is the story and lessons learned of blogging online for a decade. It goes beyond blogging topics and includes <a href="https://www.ssp.sh/blog/obsidian-note-taking-workflow/" target="_blank" rel="noopener noreffer">note-taking (workflow)</a>, how to write well as well as the medium in which writing works best, and also the format in which writing works long-term such as writing in open formats and methods such as vim motions to navigate and edit like a surgeon.</p>
<p>My prediction, and hope, is that the [[Future of Blogging]] is more connected. Not only one dimensional, like a single sheet of paper, but think of a maze, where you can go in, explore new things to learn.</p>
<p>This is how I built up <a href="/brain" rel="">my Second Brain</a>, and you can see the interactive graph at the end of this blog, connecting all notes and blogs that are related.</p>
<p>This article is based on a recent interview about &ldquo;<a href="https://open.substack.com/pub/writethatblog/p/simon-spati-on-technical-blogging" target="_blank" rel="noopener noreffer">Write that Blog</a>&rdquo;. This triggered me to finally write this piece after collecting 100s of notes related to writing online and blogging in my second brain.</p>
<h2 id="why-i-started-blogging-learning-oss-tools">Why I Started Blogging: Learning OSS Tools</h2>
<p>A quick note on how I got started. Mainly it was out of curiosity. As a business intelligence specialist with a Microsoft licence, I was more curious about open-source tools that had similar abilities as [[SSAS]], [[SSRS]] that were used at work, even more so, the programmatic first approach to automate things, instead of clicking myself through the UI of older GUI first approaches.</p>
<p>I had some when I lived in Copenhagen, Denmark, which I used to explore and document what I learned. As I already had a domain (sspaeti.com) and experience in web development with weekly party pictures that I ran for many years, but wasn&rsquo;t active anymore as Facebook and other portals got created, I decided to pivot it to a <strong>personal blog</strong> - which was very popular back then.</p>
<p>So I started with my WordPress blog and uploaded <a href="/blog/ssas-cubes-dynamic-generation-of-partition" rel="">some scripts</a> and learnings around Microsoft and related automation I learned, and was re-using often. Then I did a deep dive and a <a href="/blog/data-warehouse-automation-dwa/" rel="">series on data warehouse automation tools</a>, which got very good feedback after the initial blogs didn&rsquo;t go anywhere.</p>
<p>I found myself enjoying the process of distilling knowledge in a compact format, so others, and mainly myself, could learn new topics. The [[Feedback Loop]] was another amazing feeling that I didn&rsquo;t know beforehand, along the principle [[The more you share the more you get]] - as people were giving me suggestions, new ideas, sometimes criticism. But all to find even more open source tools and interesting approaches.</p>
<h3 id="what-started-as-a-hobby-turned-into-a-full-time-job-and-business">What Started as a Hobby Turned into a Full-time Job and Business</h3>
<p>Writing became one of my favorite hobbies, and I got lots of fulfillment, not the short term dopamine hit, but the long-term [[Deep Happiness]] of learning, getting appreciated by readers, and the process of turning my long taken notes into something more usable for people to share. [[Learn in Public|Learning in Public]] as some called it later on.</p>
<p>I reserved Friday nights in my favorite library in Copenhagen, bought my favorite coffee at <a href="https://espressohouse.com/en" target="_blank" rel="noopener noreffer">Espresso House</a>, mostly a nice cookie or something sweet, and then off for 2-4 hours. Sometimes nothing really good came out, it was hard. Other times I was just trying new tools like [[Dagster]], [[Delta Lake]] etc., and others I was in [[Deep Work|deep flow]] of writing, almost like trance.</p>


<div class="bluesky-embed-wrapper" style="display: flex; justify-content: center; margin: 1.5em 0;">
    <blockquote class="bluesky-embed" data-bluesky-uri="at://did:plc:edglm4muiyzty2snc55ysuqx/app.bsky.feed.post/3la3zwbcabo2e" data-bluesky-cid="bafyreihrz6eb6bnrgruie6ul6xixggfju646jx2afgsvztaqhpl4zbagnm"><p>Back where it all started 📚 #writing #blogging</p>&mdash; <a href="https://bsky.app/profile/did:plc:edglm4muiyzty2snc55ysuqx?ref_src=embed">Simon Späti 🏔️ (@ssp.sh)</a> <a href="https://bsky.app/profile/did:plc:edglm4muiyzty2snc55ysuqx/post/3la3zwbcabo2e?ref_src=embed">2021-10-25T09:43:10Z</a></blockquote><script async src="https://embed.bsky.app/static/embed.js" charset="utf-8"></script>
  </div>
  <script>
  (function() {
    function updateBlueskyTheme() {
      var isDark = document.body.getAttribute('theme') === 'dark';
      var mode = isDark ? 'dark' : 'light';
      
      document.querySelectorAll('.bluesky-embed').forEach(function(el) {
        el.setAttribute('data-bluesky-embed-color-mode', mode);
      });
      
      document.querySelectorAll('.bluesky-embed-wrapper iframe').forEach(function(iframe) {
        var src = iframe.src;
        if (src) {
          var url = new URL(src);
          if (url.searchParams.get('colorMode') !== mode) {
            url.searchParams.set('colorMode', mode);
            iframe.src = url.toString();
          }
        }
      });
    }
    
    updateBlueskyTheme();
    
    new MutationObserver(function(mutations) {
      mutations.forEach(function(m) {
        if (m.attributeName === 'theme') updateBlueskyTheme();
      });
    }).observe(document.body, { attributes: true });
  })();
  </script>
<p>The breakthrough came much later, three years in, when I wrote about a new upcoming topic, and how the transition from data warehouse I see, called <a href="https://www.ssp.sh/blog/data-engineering-the-future-of-data-warehousing/" target="_blank" rel="noopener noreffer">Data Engineering, the future of Data Warehousing?</a>. This was the first viral post, and popular figures like Dan Linstedt commented on it. It was surreal back then, why would these people read <em>my</em> article?</p>
<p>But it gave me the motivation to continue. Sure I love writing and sharing in public, but not sure if I wouldn&rsquo;t have people reading it, if I would have continued until today.</p>
<blockquote>
<p>[!note] A short journey of how my domain and website evolved</p>
<p>I started my first blog in <strong>2015</strong> - but I was online and registered a domain in <strong>2004</strong>. I bought the domain sspaeti.com where my first endeavor was web development with HTML, CSS and PHP. The classic Apache years (fun fact, I still deploy to apache server to this day, but it&rsquo;s only static HTMLs today :)</p>
<p>From <strong>2005-2014</strong> I ran a local forum and party guide (this was before FB :) and then in 2015 my first data-related post. <strong>2016-2018</strong>: Regular blogging on Business Intelligence and data topics. <strong>2019</strong> I started to focus more on open-source data engineering.<br>
2021 I moved from <a href="/blog/why-i-moved-away-from-wordpress/" rel="">WordPress to GoHugo</a> and <strong>2022</strong> I added the second brain to my website which meant all my notes and blogs were powered by Markdown which led me to share much more as it took me no conversion or work to publish anymore. What I wrote, I could just publish as is on <a href="https://www.ssp.sh/brain" target="_blank" rel="noopener noreffer">my second brain</a>. To this day, I have ~9000 private notes and ~1000 public notes. And 81 blog posts and some chapters of an early book I&rsquo;m writing in Markdown too :).</p>
<p><strong>2023</strong> I changed the domain to ssp.sh, as it&rsquo;s shorter :)</p>
</blockquote>
<h3 id="how-did-i-manage-to-continue-to-this-day">How Did I Manage to Continue to This Day?</h3>
<p>[[Writing is hard]] as anyone will tell who does it. So why do I torture myself to do it to this day? Even made it my full time work, as I&rsquo;m currently self-employed and work as a <a href="/services" rel="">full time author</a>.</p>
<p>The answer is not straightforward, but to say the truth, I still love it to this day. Writing words is my canvas as an artist, where I can let out my thoughts, be creative, bring something complex into simple terms. Into something that anyone might want to read.</p>
<h2 id="how-has-blogging-changed-over-the-last-10-years">How Has Blogging Changed over the Last 10 Years?</h2>
<p>During my start, where I created a WordPress personal blog website, to today, there have been different evolutions, but overall, not much has changed in terms of personal blogs.</p>
<p>These are still the same, except that we changed the technology a couple of times, from using Flash websites to hand writing PHP/HTML/MySQL to using WordPress to Medium and [[Static Site Generators (SSG)]] to Substack today, the main change is social media. Before, personal blogs had more authority. Everyone was linking to other blogs, currently a couple of social media tech giants have the monopoly and you almost need to share there to be discovered.</p>
<p>When I started, I used Twitter and LinkedIn already too, but &ldquo;the game&rdquo; of distribution has changed. But again, the sole purpose of personal blogs is the same.</p>
<p>Today with AI we are even in a new era, with all the &ldquo;AI Slop&rdquo; generated and shared all over the place. I believe, and see it as my work as a professional writer, that the <a href="https://craft.ssp.sh/" target="_blank" rel="noopener noreffer">craft</a> of writing, and [[Writing Manually]], gets even more important.</p>
<p>Writing is communication, and we can&rsquo;t communicate through a filter, which at the moment many are doing with converting bullets into prose and the reader summarizes from prose to bullet points - watering down the actual points and wording the original author has made. To the point where most people, me included, [[I&rsquo;d rather Read the Prompt|would rather read the prompt]].</p>


<div class="bluesky-embed-wrapper" style="display: flex; justify-content: center; margin: 1.5em 0;">
    <blockquote class="bluesky-embed" data-bluesky-uri="at://did:plc:edglm4muiyzty2snc55ysuqx/app.bsky.feed.post/3mfqu2gurwk25" data-bluesky-cid="bafyreibjkjjo5c3b74t7jnwthblfkeculdjqfgb54ne2yycjvre7op3jai"><p lang="en">Related. marketoonist.com/2023/03/ai-w...</p>&mdash; <a href="https://bsky.app/profile/did:plc:edglm4muiyzty2snc55ysuqx?ref_src=embed">Simon Späti 🏔️ (@ssp.sh)</a> <a href="https://bsky.app/profile/did:plc:edglm4muiyzty2snc55ysuqx/post/3mfqu2gurwk25?ref_src=embed">2026-02-26T09:11:17.336Z</a></blockquote><script async src="https://embed.bsky.app/static/embed.js" charset="utf-8"></script>
  </div>
  <script>
  (function() {
    function updateBlueskyTheme() {
      var isDark = document.body.getAttribute('theme') === 'dark';
      var mode = isDark ? 'dark' : 'light';
      
      document.querySelectorAll('.bluesky-embed').forEach(function(el) {
        el.setAttribute('data-bluesky-embed-color-mode', mode);
      });
      
      document.querySelectorAll('.bluesky-embed-wrapper iframe').forEach(function(iframe) {
        var src = iframe.src;
        if (src) {
          var url = new URL(src);
          if (url.searchParams.get('colorMode') !== mode) {
            url.searchParams.set('colorMode', mode);
            iframe.src = url.toString();
          }
        }
      });
    }
    
    updateBlueskyTheme();
    
    new MutationObserver(function(mutations) {
      mutations.forEach(function(m) {
        if (m.attributeName === 'theme') updateBlueskyTheme();
      });
    }).observe(document.body, { attributes: true });
  })();
  </script>
<h2 id="blogs-vs-second-brain-notes">Blogs vs. Second Brain Notes</h2>
<p>One approach that I like to push, and many are doing locally with [[Obsidian]], is connected note taking. I shared my Obsidian notes that are worth sharing on my <a href="/brain" rel="">public second brain</a> (find my process at [[Public Second Brain with Quartz]] of adding <code>#publish</code> and it will be on my site, no conversion needed, the code and utilities are shared on <a href="https://github.com/sspaeti/second-brain-public" target="_blank" rel="noopener noreffer">GitHub</a>, too)</p>
<p>Bringing back connected personal notes, but also internally on your website - using synergies between your blog and second brain. The way I think as of now about [[Sharing as Second Brain Note vs a Blog Post]]:</p>
<blockquote>
<p>The second brain helps me to share whatever is in my mind, and the blog helps me to refine. <strong>Notes compound</strong> and always evolving. Blog posts <strong>capture a moment in time</strong>.</p>
</blockquote>
<p>There&rsquo;s also the difference between long-term, always updated, and [[compounding]] notes vs. the one time distilled blog article. They work so well together. As you might notice, most of my links in this article, with much more information, are long-term notes that I&rsquo;m collecting and refining over the years, linked to my second brain.</p>
<p>This way, I can bring all notes into one storyline, the way I&rsquo;m currently thinking, sharing it in the form of a blog, as this one, while continually updating the related notes all linked here on long-term strategy for blogging, and with its different [[Type of Notes]].</p>
<h3 id="connects-knowledge-helping-learning-the-same-way-as-our-brain-does">Connects Knowledge: Helping Learning the Same way as Our Brain Does</h3>
<p>I&rsquo;m thinking of Designing Data-Intensive Applications by Martin Kleppmann), where he added maps to his book, that correlated similar terms:<br>


<div class="bluesky-embed-wrapper" style="display: flex; justify-content: center; margin: 1.5em 0;">
    <blockquote class="bluesky-embed" data-bluesky-uri="at://did:plc:edglm4muiyzty2snc55ysuqx/app.bsky.feed.post/3mfruvqq2rc2o" data-bluesky-cid="bafyreibvvckb3lxqwtsbvaxow2h2cmmlpbodv3nyuxxcmfetn2siaw7wuq"><p lang="en">Here are some of the maps. See how Kafka is close to Kinesis. Really like them.</p>&mdash; <a href="https://bsky.app/profile/did:plc:edglm4muiyzty2snc55ysuqx?ref_src=embed">Simon Späti 🏔️ (@ssp.sh)</a> <a href="https://bsky.app/profile/did:plc:edglm4muiyzty2snc55ysuqx/post/3mfruvqq2rc2o?ref_src=embed">2026-02-26T18:59:13.373Z</a></blockquote><script async src="https://embed.bsky.app/static/embed.js" charset="utf-8"></script>
  </div>
  <script>
  (function() {
    function updateBlueskyTheme() {
      var isDark = document.body.getAttribute('theme') === 'dark';
      var mode = isDark ? 'dark' : 'light';
      
      document.querySelectorAll('.bluesky-embed').forEach(function(el) {
        el.setAttribute('data-bluesky-embed-color-mode', mode);
      });
      
      document.querySelectorAll('.bluesky-embed-wrapper iframe').forEach(function(iframe) {
        var src = iframe.src;
        if (src) {
          var url = new URL(src);
          if (url.searchParams.get('colorMode') !== mode) {
            url.searchParams.set('colorMode', mode);
            iframe.src = url.toString();
          }
        }
      });
    }
    
    updateBlueskyTheme();
    
    new MutationObserver(function(mutations) {
      mutations.forEach(function(m) {
        if (m.attributeName === 'theme') updateBlueskyTheme();
      });
    }).observe(document.body, { attributes: true });
  })();
  </script></p>
<p>I see the connected, interactive graph on my second brain, and on my book (just <a href="https://bsky.app/profile/ssp.sh/post/3mfrlc74i7s2k" target="_blank" rel="noopener noreffer">recently added</a>) the same way. It helps learning.</p>
<p>It&rsquo;s proven that we learn much better if we can associate to an existing term or something we know, versus something new that is orphaned in our brain, without a connection and synapse to another thought (or note in our case). It&rsquo;s hard to remember and learn from it.</p>
<p>In a [[Second Brain]] and [[Digital Garden]] approach, you connect every note at least to one existing term. I also like to add its <code>origin</code> so I always know where it came from. More on <a href="/blog/obsidian-note-taking-workflow/" rel="">My Obsidian Note-Taking Workflow</a> if that interests you more.</p>
<p>For example the below note about [[Functional Data Engineering]] (← click here to see the graph and backlinks in action) shows how besides the written text, you can glance connected notes through the interactive graph or through backlinks.<br>
![[img_index.en_1772811747298.webp]]</p>
<p>We can visually see things that are otherwise almost impossible to grasp or see. Like a map of a city can convey information density that no chat or explanation can do by explaining to someone on the phone or in written text. It&rsquo;s the same with the graph.</p>
<p>And the best part, it&rsquo;s additional, so you don&rsquo;t need to look at it at all. But most helpful when you want to learn or might not know the space that well yet, you can see a term or connection you know, and immediately connect your brain, that these belong together, probably remember forever, or much longer.</p>
<p>E.g. in the above example, we might see that functional data engineering is linked to clarity, and to [[No Less Code vs Code|Code Is Still the Best Abstraction]], which might be non-obvious, but really helpful to know.</p>
<p>Again, linked notes are the best way to organize knowledge, especially optimized for <strong>learning</strong>. Knowledge doesn&rsquo;t grow linearly. It expands as a network over different seasons. I write more about that phenomenon and continue to update at [[Future of Blogging]].</p>
<h3 id="the-process---and-the-difference-between-long-term-and-short-term">The Process - And the Difference between Long-Term and Short-Term</h3>
<p>My process is essentially:</p>
<ol>
<li>idea occurred by reading a book, listening to a podcast, talking to someone or else</li>
<li>writing it down in my private second brain</li>
<li>connect to existing notes, refine the note, add new thoughts and notes</li>
<li>eventually add <code>#publish</code> and publish online on my public second brain</li>
<li>eventually distill that note with many others and write a blog post about a specific and related topic of that note</li>
<li>continue updating and refining the note</li>
<li>eventually writing another blog that relates to it, using that improved note</li>
<li>using the note for my book topics</li>
<li>refining note</li>
</ol>
<p>I think you get the gist. The initial one liner, the note that just existed based on a real insight, is the actual most important information of the whole process, in my opinion.</p>
<p>Not to say that the blog articles are not, but they both need each other. With writing the blog, I distill and connect multiple notes at a current timestamp into a frozen article. During that process, notes also get updated. And while sharing the blog post, I get a lot of feedback, [[Feedback Loop]], which I then instead of adding to the blog, which I can&rsquo;t as it&rsquo;s a snapshot in time, add to the existing note.</p>
<p>So <strong>feedback is actively processed</strong>, and massaged into my private or public second brain, improving my overall approach. I think this process of connected notes from private, to public note to blog and continued note, is even more powerful than Niklas Luhmann&rsquo;s [[Zettelkasten]], that revolutionized the [[Smart Note Taking]] approach based on zettels, small unique ideas that he connected to others.</p>
<blockquote>
<p>[!info] No Research needed with this process</p>
<p>Related is also that through this process, [[Why I don&rsquo;t Research|I don&rsquo;t need to research]] in a classical sense for topics. As my life and insights come steadily in, and get massaged and integrated like a slow flowing river, all organically.</p>
</blockquote>
<h3 id="the-form-of-writing-long-form-or-short">The Form of Writing: long Form or short</h3>
<p>My writing usually tends to be very long-form. Because I take lots of notes that I try to connect and write in an interesting way, I tend to get very long. Same as this writing already is, but I still have 3000+ notes collected to go.</p>
<p>Also, long-form writing <strong>evokes a deeper relationship and trust that is hard to captivate</strong> with a couple of words. That&rsquo;s also why a [[Reading Books for a Happy Life|book]] can connect you to the author like no other medium can.</p>
<p>It&rsquo;s also a question of <strong>long term game</strong> and writing for it to stay relevant for many years to come, or just capturing a quick trend and harshly (fast?) putting out a blog. These are totally different forms of content, and strategies. The latter usually also used on social media, to create big attention to go viral for a short term.</p>
<p>But what I always ask myself, what&rsquo;s the gain from it? Ultimately, likes and followers are just a [[Vanity Metric]], and to me at least, don&rsquo;t count as much as a real human reading these words. Not leaving a like or comment, but just having made a connection or an impact on someone in another part of the world I don&rsquo;t know (yet? I&rsquo;m always happy to get introduction emails from my readers! :)). Or just inspiring or making you think about something related, or just learning something new.</p>
<p>That&rsquo;s at least my main goal. There&rsquo;s no hidden goal or message behind my writing. Obviously if I write for my clients, it&rsquo;s a bit different, as I want them to succeed, whereas I write for myself, I just want to let out my thoughts. But what I learned over the years is that [[Writing from The heart]], being genuine, is also helping for work related topics, as at the end of the day, it&rsquo;s still a human being reading it - and therefore the same principle applies as if it were just a random blog post.</p>
<h2 id="my-writing-process">My Writing Process</h2>
<p>I&rsquo;d like now to switch gears a bit as we went through the differentiation of compounding, refined notes and in-time blog posts, and talk about my writing process that I have mastered or improved over the years.</p>
<p>This is my unique writing approach and even more so, note-taking approach. This is not how you have to do it, and probably won&rsquo;t work for you. As I learned on the <a href="https://www.youtube.com/watch?v=KU5FUqbqMK0&amp;list=PLFxhXLgGkVzKCn23_g8qM19DMDgco8eNJ" target="_blank" rel="noopener noreffer">How I Write (Podcast)</a> by David Perell, each author has a totally different approach. And this here is mine.</p>
<p>But here I want to share a little bit more about my strategies, my approach to writing, and my tips and tricks I have learned and noted down over the years.</p>
<h3 id="i-spend-many-hours-weeks-and-months-on-single-blog-posts">I Spend Many Hours, Weeks and Months on Single Blog Posts</h3>
<p>But even before that, a quick prefix: I spend many hours, weeks, sometimes months on a single blog post. <a href="/blog/why-are-we-here-on-earth/" rel="">Why Are We Here on Earth?</a> for example, I wrote over the course of two years - but if you include my note taking, it sometimes is obviously even longer, because some notes of mine are ten years and older.</p>
<p>There&rsquo;s also the longer I work on it, the more learning, and sometimes struggles I can put into a piece, which helps the piece to not get outdated the next months too. Something that I&rsquo;m grappling with over months and years most probably won&rsquo;t be gone tomorrow.</p>
<p>This is one reason why I don&rsquo;t like to write too much about AI at the current pace, everything I write, and would spend lots of hours might be outdated the moment I publish.</p>
<p>That&rsquo;s why my approach is just collecting and refining my thoughts on a second brain note, for example in this case on [[Will AI replace Humans]] and related notes - which made it already on the frontpage of Hackernews - but I&rsquo;m sure at some point I will take that note and all its relevant related notes, and will distill into a single blog post. But the time hasn&rsquo;t come yet, as so much is changing.</p>
<p>But it&rsquo;s not that I don&rsquo;t do it at all, sometimes I will write something quick, something maybe less long-term, but usually, it&rsquo;s just less fun for me to write, and maybe less challenging? Although, some topics and articles that I have slept over too long just poured out of me in one go, no sophisticated linking in my second brain or other approach, just a blank page and writing it down. But usually these are topics that I have read extensively about, I&rsquo;m discussing with people or are just dear to my heart, that my subconscious is just working on it until it&rsquo;s telling me it&rsquo;s ready, and then I must not miss the opportunity and just write it down.</p>
<p>A little similar to this piece and topic. It&rsquo;s so dear to my heart, and something I wanted to write for so long, that I have never done it, and now with the &ldquo;write that blog&rdquo; interview, it triggered so many questions that I just went on and wrote until now in one flow. No breaks, just free flow and combining different notes I have in my Obsidian vault.</p>













  
<figure><a target="_blank" href="/blog/why-i-still-blog/img_index.en_1772814021534.webp" title="">

</a><figcaption class="image-caption">How my Second Brain looks like while writing this very article</figcaption>
</figure>
<p>This is how my vault and process looks right now, with:</p>
<ol>
<li>The current note</li>
<li>Is the long long outline (you can&rsquo;t even see half of it),</li>
<li>Is related notes through smart connections</li>
<li>Is the initial write that blog interview I answered</li>
<li>Is another connected note I have written just above about</li>
<li>These are more related notes</li>
<li>And you see word counts on the lower right and Vim Motions I write the article in</li>
</ol>
<blockquote>
<p>[!example] Don&rsquo;t focus too much on the numbers, but on writing<br>
With social media, you could focus too much on [[Vanity Metric]], and how many likes you get on social. But I try not to give too much about it, though it&rsquo;s still needed. I wrote more about <a href="/blog/well-being-algorithms/" rel="">Well Being in Times of Algorithms</a>, my personal essay towards a better World Wide Web, and how well-being is connected to social media.</p>
</blockquote>
<h3 id="ultimate-goal-good-storytelling">Ultimate Goal: Good Storytelling</h3>
<p>If I had to summarize my writing process, or the goal of it, then it crystallized to me lately that the goal is to have the ultimate storytelling. I want to write about a topic that has an intro that catches the attention, then has a great body, and finishes with a hook and ties everything together.</p>
<p>Storytelling like in the movies, it&rsquo;s true for writing too, where they have the main act, second act, the villain etc. But with the difference, you can&rsquo;t use fancy show effects, you are left with simple words.</p>
<blockquote>
<p>[!abstract] What is Good Storytelling?</p>
<p>This obviously is very personal, and differs from person to person. To me, most of it boils down to the art of leaving things out, which I am getting much better at over time. And I think that is really what storytelling is all about.</p>
</blockquote>
<p>That&rsquo;s why it&rsquo;s very important to use what you have as a writer. In writing a blog like this, one of the most important and one I like to use most, it&rsquo;s the length of a paragraph, making them look good, break at the right timing when re-reading. End on a high, start the next that connects but with a new insight. The paragraphs should be of different lengths, they should be interesting, and change over time.</p>
<p>Mix it up with images, with some quotes or what I like a lot, [[Admonition (Call-outs)|callouts]]. The reason why I like callouts is I can add an additional story, a side note in a way to not distract from the main storyline, but I can serve some readers who like some behind the scenes or more information. Plus they look beautiful in my eyes, adding different colors to the blog post as each of my different types of callout has a different color. It makes it more interesting aesthetically, and that also helps to want to read something more in my opinion, the aesthetic can help big time.</p>
<h3 id="leave-with-a-spark-making-it-interesting">Leave with a Spark✨. Making it Interesting</h3>
<p>Besides the ultimate goal of having a good storyline, having a common thread, a nice reading flow and outline is key to keep you, the reader, engaged. I like to jump a bit around.</p>
<p>Not only cutting some topics short, moving on to something else, maybe coming back, maybe not, leaving the reader in the blank, making it more interesting. The <strong>key of good writing</strong> is leaving out what needs to be left out. It&rsquo;s an artform, because I could ramble forever on this topic, as I&rsquo;m super passionate about it - as you might have noticed - but I need to always keep in mind to not bore you. To give you new insights.</p>
<p>That&rsquo;s why I&rsquo;m switching now to making it interesting. Besides switching from topic to topic, I also like to go very deep in a topic, and then zooming out very high-level, only to go very deep again in the next sentence.</p>
<p>Zooming in and out helps the reader to not lose the overview, but also learning something new. I usually don&rsquo;t spend too much time in the middle &ldquo;zooming level&rdquo;, this section is boring to me as it&rsquo;s too vague (not detailed and concrete, and not guiding with not enough overview).</p>
<p>Switching all the time might feel a little <strong>like a rollercoaster</strong>, but rollercoasters are fun, so do I envision my articles. Writing and its process boils down to me to:</p>
<blockquote>
<p>[[Writing from The heart]] is the best. True, honest, and genuine human-to-human communication.</p>
</blockquote>
<h3 id="writing-from-curiosity">Writing from Curiosity</h3>
<p>My best writing comes from my own curiosity. I want to answer a question for myself, even better if I don&rsquo;t know the answer beforehand. Not knowing where I&rsquo;m heading to.</p>
<p>I usually set a title, and then go with the flow, see where it leads me—these are the best writings of mine. If I have to write about a certain topic, or outline, it couldn&rsquo;t be more boring, and that&rsquo;s usually reflected in my writing too.</p>
<p>The exception is if I know the space very well, I can write a leadership thought piece, bringing together 20 years into one blog. The challenge is again in nailing the storytelling part to make twenty years coherent from start to end.</p>
<h3 id="my-writing-style">My Writing Style</h3>
<p>Finding your writing style and writing voice is something that was very hard for me. But I think is key to become a writer, especially a professional writer.</p>
<p>It takes time. What helped me for sure, was to read many books, finding my favorite authors and identify their writing style. This helped me to find that I liked <a href="https://sive.rs/" target="_blank" rel="noopener noreffer">Derek Sivers</a> books. Initially, I didn&rsquo;t know why, until I found more about his writing style, his personality in interviews, read more from him, and analyzed his work.</p>
<p>I found that his minimalistic style, to scrap each unneeded word, straight to the point and providing value while inspiring with different takes that you haven&rsquo;t heard already a hundred times. He writes <strong>genuinely</strong>, he also [[Journaling|journals]] 3-4 hours almost daily, thinking and brainstorming a lot in his head and second brain.</p>
<p>That&rsquo;s also what led me to journal and write in my second brain, like a physician, experiment with different formulas and ideas, to see what comes out. That&rsquo;s my <strong>second brain idea creation</strong>. I have the two phases, the idea creation and finishing part. Again, the second brain is where I start with a one-liner, a note from a friend, listing interesting things, linking to existing notes.</p>
<p>Later when distilling into a blog post, or sharing a public second brain note, I will tackle the deeper meaning, the connection with other ideas and areas of my life or things I&rsquo;m currently learning or have learned a long time ago.</p>
<p>I try not to force it. How many times have I tried to force it, only to go to bed early, and the next day wake up and just have it flow out of my fingers.</p>
<h3 id="writing-is-rewriting">Writing is Rewriting</h3>
<p>Most of it is also just <strong>rewriting</strong>. When you write something 3-4 times, when you sleep over it, your subconscious has worked on it while you walk, it always gets better.</p>
<p>It&rsquo;s a way of <strong>personal editing process</strong>. It&rsquo;s also a way of writing style. Jason Fried and Haruki Murakami from <a href="https://www.goodreads.com/book/show/143361343-novelist-as-a-vocation" target="_blank" rel="noopener noreffer">Novelist as a Vocation</a> (an amazing book for writers), are constantly re-writing. Sometimes based on feedback of readers, sometimes based on a [[Gut Feeling]].</p>
<blockquote>
<p>[!example] A short story from the book Novelist as a Vocation<br>
Haruki Murakami writes in his book that once he lost a manuscript of a chapter. He was devastated, but had no other choice but to rewrite it.</p>
<p>Years later he found the manuscript again and was afraid it would be better than what he had handed in for his book that was already published. But the fear was all wrong, it was so much worse, he writes.</p>
</blockquote>
<h4 id="finding-your-voice-but-how">Finding Your Voice, but How?</h4>
<p>So, how do you find your voice?</p>
<p>After all the different ways I wrote, I read from other authors, there is no one way, and I can&rsquo;t tell you how yours will be, other than you start writing, and trusting in the process.</p>
<p>For me, my writing voice I defined as written in personal first person voice. Something I have experienced I can easily explain or write about. But making up a fake story, something they let you do in school, is something I was never good at.</p>
<p>I try to be <strong>authentic, friendly and succinct,</strong> with the goal of adding value to the readers.<br>
I&rsquo;m trying to give clues and tools concrete and extremely specific but also leave things out. Because I&rsquo;m not an academic and I can only make suggestions about what I learned but cannot solve all the problems I write about. I also try not to overly copy others&rsquo; ideas, but make them my own through connecting and my own unique life experience. Including hardship and daily struggle and just life.</p>
<p>Key is also that you share in public. Keep writing until you find your voice. From there on, it will be much easier.</p>
<h3 id="how-i-found-my-writing-voice-through-the-english-language">How I Found My Writing Voice: Through the English Language</h3>
<p>I did live abroad for almost three years in a foreign country, learning Danish, but even more so English. And what happened there is what I would never have predicted.</p>
<p>The more I learned the language of English (I wasn&rsquo;t fluent before), I started to read more books. I noticed that there are so many more books that weren&rsquo;t available before, when I only read in German.</p>
<p>Also, I found out that I really liked the English language, the simpler grammar compared to High German as we like to call it in Switzerland. I found that I can express myself much better, more precisely as English has so many words for almost the same meaning, so you can choose and pick one that exactly describes what you want to say. Whereas in German I felt I always need to write a full novel to explain a simple thing very specifically. This might be good for fiction, but not for my technical writing, or also what I write here.</p>
<p>All of a sudden I was reading books all my free time, listening to <a href="https://tim.blog/podcast/" target="_blank" rel="noopener noreffer">Tim Ferriss</a> and all his guests on his podcasts, and learning every day. Also, that was the time when I went to the library in Copenhagen and started writing, in English, a secondary language I was just about to learn properly (I had English in school before) and could converse and have small talk. I wrote more about my journey and about <a href="/blog/finding-my-pathless-path/" rel="">Finding My Pathless Path</a> if you are curious to know more about that.</p>
<h4 id="simple-english">Simple English</h4>
<p>As you know now, and might get from my grammatical errors here and there, my English isn&rsquo;t my mother tongue.</p>
<p>For a long time I saw that as a disadvantage, but lately I figured that it might even be a strength of my writing. With my somewhat limited English vocabulary and language skills, leading to my articles and writing being much <strong>simpler English</strong>.</p>
<p>And one thing I learned over the years, the easier you can explain complex topics, and make it approachable, the easier for my reader to follow along. Also it makes the writer more approachable, less &ldquo;snobbish&rdquo; maybe?</p>
<p>And this is an advantage. My writing is much more approachable this way. It adds a natural constraint to my writing and makes my writing process potentially easier, that I am not even aware of during the writing, but helps me in a certain way I do write.</p>
<h2 id="the-writers-toolkit-and-the-tools-i-use">The Writer&rsquo;s Toolkit: And the Tools I Use</h2>
<p>Let&rsquo;s come to the last bigger part of this already long article, the tools and methods I use.</p>
<p>My main tool is writing in an open format, that is just [[Markdown]] and then using [[Obsidian]] as the editor to connect these simple notes together in a meaningful way.</p>













  
<figure><a target="_blank" href="/blog/why-i-still-blog/img_Todays%20Daily%20Graphs%20-%20Obsidian%20Graph_1771602123200.webp" title="">

</a><figcaption class="image-caption">My latest Obsidian graph with <code>9057</code> notes. Follow along with more on this <a href="https://x.com/sspaeti/status/2024872752100913586" target="_blank" rel="noopener noreffer">Tweet</a></figcaption>
</figure>
<h3 id="the-different-modes-of-my-writings-with-vim-motions">The Different Modes of My Writings with &lsquo;Vim Motions&rsquo;</h3>
<p>Apart from that, the way I write is with something called [[Vim Language (and Motions)|Vim Motions]]. I have written extensively about it, and you might think these matter not so much.</p>
<p>I hope, if you write online, or program for a living, that you learned touch typing at some point in your life. If you have, you&rsquo;d agree that it tremendously helped you with everything working on the computer, right? Not needing to see each key before you press.</p>
<p>Vim motions go a step further, essentially making each key on your keyboard a tool. In the default mode, when opening vim, each key press is doing a function. E.g. <code>g</code> is for jumping around (<code>gg</code> is jumping to the top of the document, <code>ctrl + o</code> jumps back where you left before. <code>G</code> jumps to the end of the document. <code>$</code> jumps to the end of a line. And so on, I could go on forever, but if you want to actually write something, you&rsquo;d need to switch to &ldquo;insert-mode&rdquo; with <code>i</code> (there&rsquo;s also <code>a</code> for append or <code>o</code>), but you have different modes. I wrote in [[Four Modes of Writing]] I have four modes with vim motions at all times:</p>
<ol>
<li>NORMAL mode: jumping around, reading, learning</li>
<li>INSERT mode: writing, thinking, making connections</li>
<li>VISUAL mode: copying, highlighting, format, designing</li>
<li>COMMAND mode: automate, fix, macros.</li>
</ol>
<p>These vim motions, which are different from [[vim]] or [[Neovim]], the editor, are also available in Obsidian and almost any editor you know. Even Gmail has shortcuts like <code>j</code> to go down or <code>k</code> to go up, two common ways of navigation in vim motions. Even more, vim has a language, the <strong>vim language</strong>. This is super helpful as you don&rsquo;t need to memorize 1000s of commands by heart, but can combine them. Almost like Streetfighter where you can do a combo.</p>
<p>Besides vim motions, which I write much more on <a href="/blog/why-using-neovim-data-engineer-and-writer-2023/" rel="">Why Vim Is More than Just an Editor</a>, you can edit at the precision of a surgeon. If interested, also check out my video on [[Vim Motions for Writers]], where I made a timelapse of how that looks:</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/6kaOcYg0io8?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<blockquote>
<p>[!tip] A Trick you can use:  <a href="/brain/writing-within-the-app-vs.-a-note-app" rel="">Writing within the App vs. a Note app</a></p>
<p>Sometimes it&rsquo;s hard to write in your notes app or offline, but if you send an email to a friend, or write in a LinkedIn post or in your blog editor, the pressure is on. You know it&rsquo;s going to go live, or it&rsquo;s for a certain friend. This can help you unblock [[Writing is Hard|writer&rsquo;s block]] or produce better quality.</p>
</blockquote>
<h2 id="the-medium">The Medium</h2>
<p>Different mediums to publish, to write on, to take notes are out there, when to use which?</p>
<h3 id="the-medium-to-publish">The Medium to Publish</h3>
<p>Which medium, which platform do you use? Nowadays you have many options, Substack, Medium, Ghost or other [[Open Subscription Platforms]].</p>
<p>I would always recommend having your own domain. If you like web design and tinkering a bit, even more now with [[Vibe Code Agents|AI Agents Tools]], you should start with a [[Static Site Generators (SSG)]].</p>
<p>These allow you to use Markdown as the format, and owning your content, not losing backlinks when switching platform, and building the domain ranking over time, leading to more authority when searching for a job, or your own business or side projects. Or also just a hobby where you share learnings and topics of interest to you.</p>
<blockquote>
<p>[!example] Medium in Taking Notes<br>
I wrote much more on [[Digital vs Paper]], where I explain that I use both - and also share examples of how that process from physical paper notes to digital notes can look like.</p>
</blockquote>
<h3 id="tools-laptops-distraction-free-typewriter-unitaskers">Tools: Laptops, (distraction-free) Typewriter, Unitaskers!</h3>
<p>An important part is also to make it fun! I do that with different devices. Recently I even used the old typewriter of my grandfather. It showed me the power of [[Uni-taskers]].</p>
<p>The typewriter can only write, not like a laptop where you can surf the internet, or play games or get distracted by social media. Just typing ahead. So refreshing.</p>
<p>That&rsquo;s where I went down the rabbit hole of [[Distract-Free Typewriter|Distraction-Free Typewriter]] and bought myself a small <a href="https://github.com/unkyulee/micro-journal" target="_blank" rel="noopener noreffer">Micro Journal</a>. A digital device solely for typing. Obviously it can connect to the internet and could do much more - as I installed Linux on it - but the resources are so limited that already running [[Neovim]] that I remodeled to a <a href="https://wp.ssp.sh" target="_blank" rel="noopener noreffer">Wordprocessor</a> struggles to open. So there&rsquo;s no danger of doing anything else.</p>
<p>Also because of the limited screen real-estate, it lets you really focus on the writing, and less so editing an article. So it&rsquo;s really a joy to use to exercise my [[Creative Writing]] vein. Just for the joy of writing.</p>













  
<figure><a target="_blank" href="/blog/why-i-still-blog/img_My%20Typewriter%20%28Hermes%202000%29_1763202607476.webp" title="">

</a><figcaption class="image-caption">Left is the Hermes 2000 from my Grandfather, and my distraction-free typewriter (Micro Journal Rev. 2), and one of my three unique keyboards I love (<a href="brain/kinesis-advantages-2-lubing-and-dampering" rel="">Kinesis Advantage 2</a> in this case)</figcaption>
</figure>
<h3 id="the-future-proof-format-markdown">The Future-Proof Format: Markdown</h3>
<p>As mentioned already, [[Markdown]] is the format of choice for me. Especially after I was trapped in Microsoft OneNote and its proprietary format, I couldn&rsquo;t get my own notes out of it. It was super key to have something that will surpass the test of time. And there&rsquo;s nothing more than [[Plain Text Files]] with Markdown.</p>
<p>I gave a full talk about this topic at <a href="https://www.youtube.com/watch?v=BOJFHMtyqNs" target="_blank" rel="noopener noreffer">Knowledge Management in the Digital Age: From Zettelkasten to Startup Owner</a>, check that out if you want to know more why Markdown, its advantages over [[Rich Text]], and how you can build a note taking setup that works with Obsidian - and even setting the foundation for a solo business like mine.</p>
<p>Markdown has many more advantages. The format has been proven to be the best for agents. I can easily share my public second brain in Markdown with [[Quartz - Publish Obsidian Vault|Quartz]] with no conversion or manually copying notes between rich-text editors and website.</p>
<p>Markdown has all the advantages of simply writing with minimal formatting sugar. It has the advantage that the formatting lives as part of the text, which makes copy pasting not lose all the links or bold/italic etc, which we put for a reason.</p>
<p>Markdown is declarative, meaning you can automate things, you can have the same text, but different engines to present. E.g. I use [[HackMD]] for collaboration (Google Docs for Markdown), and I use Markdown to publish on my website. It&rsquo;s the same file, the same format, there is no conversion needed.</p>
<p>Compare this to your typical Google Docs, WordPress, Webflow, or other [[Open Subscription Platforms]] such as Substack and Medium. These tend to enforce constraints, you need to always copy your text back and forth, creating copies of your text, potentially losing important formats.</p>
<p>The other big advantage, Markdown files are just [[Plain Text Files|Plaintext Files]], meaning we own the files - no big tech or company can forbid access for us or take them away, they <strong>work offline</strong> when we don&rsquo;t have internet, and they are <strong>super fast</strong> as it&rsquo;s just tiny files that are locally stored, no round trips to the server.</p>
<blockquote>
<p>[!info] My Note-Taking Path from forgetting everything to Obsidian with Vim and Quartz<br>
My path so far with note-taking:</p>
<ol>
<li>Forgetting everything</li>
<li>Taking scattered and very detailed notes on multiple devices, apps, and paper</li>
<li>Improving during my studies with OneNote, where notes related to work or study go into separate notebooks (no notes for personal notes yet).</li>
<li>Starting to create a personal notebook for travels, personal research, etc.</li>
<li>There is still a lot of confusion about:<br>
1. where to store my notes<br>
2. changing of the folder structure<br>
3. finding older notes is complex and rarely happened</li>
<li>Switching to <strong>[[Obsidian]]</strong> with a new open format and a different spirit and capabilities.</li>
<li>Starting my <strong>[[Second Brain]]</strong><br>
1. Constantly updating my long-time wealth of personal knowledge by adding notes about my health, journals, cooking, books I read, and everything related to my life.<br>
2. I Started to connect notes and sophisticate my system in a way that I confidentially find it later down my life span, the moment I need it.</li>
<li>Start using [[Vim]] and, more importantly, its <strong>[[Vim Language (and Motions)|motions]]</strong> for fast and effortless note-taking.</li>
<li>Sharing them publicly with [[Quartz - Publish Obsidian Vault|Quartz]].</li>
</ol>
<p>Find the full break down on <a href="/blog/obsidian-note-taking-workflow/" rel="">My Obsidian Note-Taking Workflow</a>.</p>
</blockquote>
<h2 id="wrapping-up">Wrapping up</h2>
<p>I wanted to write much more about the <strong>art of writing</strong>, how to [[Writing|write]], [[How to Write Well]], and generally more the technique and art of writing. But as this article is already very long, I will save that for another long, distilled blog post in the future. You can follow the above links already, where I wrote a lot, but not in this distilled blog format.</p>
<p>Maybe one day I will write a book about it, I have so much more to say and tell 🙂. What do you think, would you read it? 🙃</p>
<p>Now I want to leave you with some <strong>unexpected impacts</strong> that writing had on me, and how to get started with blogging:</p>
<ul>
<li>My articles and website focus on data engineering, but my most successful posts (in views and virality) are topics about [[Obsidian]], [[Vim Motions for Writers|vim]], and philosophy</li>
<li>That was a surprise — but now I get it: these were where I had something on my heart, something I poured many years into and put into a single writing</li>
<li>Career impact: got higher salaries because I was considered known &ldquo;world-wide&rdquo; through my blog</li>
<li>People knew my writing, and typically tend to like you as you give the writing for free</li>
<li>When I started my own company, I basically didn&rsquo;t have to sell — being online for so long, people know my writing, my principles, and even my life through my [[Second Brain]]</li>
<li>The total set of articles is what matters: when people come back to you, that&rsquo;s the [[compounding]] effect</li>
</ul>
<blockquote>
<p>[!question]  The Elephant in the Room: AI<br>
You might ask, but what about AI content, isn&rsquo;t that a reason to not start to write? I&rsquo;d say no for the same reason writing was good before AI and before books, and before time.<br>
It&rsquo;s good for yourself, to calm down, let out all your thoughts. Distill, learn and also remember things that are very important to you.</p>
<p>[[Writing Manually]], as I like to call it, is also my joy of writing - if you enjoy it, I enjoy it. It&rsquo;s hard, it&rsquo;s a challenge, it&rsquo;s not easy. I get great satisfaction. Like chess, computers are much better, but we still play chess. And also for the love of words and communication, <strong>writing is communication</strong>. And to prevent more &ldquo;AI Slop&rdquo; from being created.</p>
<p>Read much more on [[Will AI replace Humans]], where I share my latest on that topic.</p>
</blockquote>
<p>I have so much more to say, which a lot of it is in my second brain, so feel free just to browse and explore my <a href="/brain" rel="">second brain</a>. Use the backlinks and the graph to explore more. I even added a <a href="https://explore.ssp.sh" target="_blank" rel="noopener noreffer">semantic search</a>, so you can find hidden connections on my public brain, on topics that might interest you, or on this very topic you just read.</p>
<p>A good book recommendation that is related (I share more on <a href="/books" rel="">Book Recommendations and Notes</a>), that helped me a ton, which I also wrote about <a href="https://pathless.ssp.sh/" target="_blank" rel="noopener noreffer">finding mine</a>, is <a href="https://www.goodreads.com/book/show/60135094-the-pathless-path" target="_blank" rel="noopener noreffer">The Pathless Path</a> by Paul Millerd. This is the Tim Ferriss 4-Hour Workweek book, updated for today, and it inspired me to take the step of going full-time as a <a href="https://www.ssp.sh/services" target="_blank" rel="noopener noreffer">freelance technical writer</a> and making writing my job.</p>
<p>What if you want to get started yourself? Read my interview on &ldquo;Write that Blog&rdquo; where I share more suggestions about starting your own blog. But generally, [[writing is hard]], just get started.</p>
<p>Have a note taking app (one only, based on an open format, available offline too), and write things down. Save important moments in your life, blessings of people telling you, insights from books you read, etc.</p>
<p>Follow the mantra of [[Learn in Public|Learning in Public]]. Use the feedback loop, let&rsquo;s share and learn together. And know that over time, all your notes and personal knowledge will compound, like money does if you invest it cleverly.</p>
<blockquote>
<p>[!note] Maybe the easiest way to get started: just an email converted to a blog</p>
<p><a href="https://www.hey.com/world/" target="_blank" rel="noopener noreffer">Hey World</a> has this feature integrated into their email — you can write an email as you would normally, just a different recipient, and if sent, it will be online as a normal blog post. A very easy way to get started.</p>
</blockquote>
<p>I&rsquo;m leaving some links to <strong>tips and tricks for unblocking</strong> and finding a good rhythm.</p>
<ul>
<li>[[Coffee Break Rhythm]]:  Using location pressure as a productivity forcing function. For me, in summer, it&rsquo;s moving every 1.5-2 hours from coffee shop to coffee shop, creating a pressure to finish up before leaving. It tricks the brain into not thinking: &ldquo;Oh, I have still all day long time&rdquo; and then procrastinate, the opposite of [[Productive Procrastination]], where we let go on purpose to get some insights we wouldn&rsquo;t have gotten otherwise.</li>
<li>Use the doubts, the signs that you can&rsquo;t write today. Take a walk instead, best would be to walk everyday. I call it the [[Productive Procrastination]]. Many see it as a bad thing, but it&rsquo;s unavoidable, and me, and also others, believe it&rsquo;s our body, gut telling us something. E.g. Tim Urban from <a href="https://waitbutwhy.com/" target="_blank" rel="noopener noreffer">Wait But Why</a> is also big on procrastination. He says the same. He hates to procrastinate sometimes, but that&rsquo;s how his brain works, and where he gets some insights he wouldn&rsquo;t have gotten without.</li>
<li>[[Ultradian Rhythm]]: Know that the first 10 minutes of a 90-minute deep work block are always going to be hard. That&rsquo;s okay. Remind yourself of this when starting is hard.</li>
<li>Embrace the <a href="https://www.ssp.sh/blog/owning-things-attention/" target="_blank" rel="noopener noreffer">New Luxury of Boredom</a>: I feel we are at a turning point. We, the people, want to own things, want distraction-free experiences, and above all, want tools that benefit us, not the pockets of large companies. There are more stories that people <a href="https://www.youtube.com/watch?v=c3oXoF9XW_Q&amp;ref=ssp.sh" target="_blank" rel="noopener noreffer">use old iPods</a> for music, buying the music. Typing on a typewriter solely for writing (like I did <a href="https://ssp.sh/brain/distract-free-typewriter/" target="_blank" rel="noopener noreffer">distraction-free typewriter</a>), <a href="https://world.hey.com/dhh/why-we-re-leaving-the-cloud-654b47e0?ref=ssp.sh" target="_blank" rel="noopener noreffer">leaving the cloud</a>, or just using <a href="https://ssp.sh/brain/local-first/" target="_blank" rel="noopener noreffer">local first</a> products like Obsidian, DuckDB. Devices that are <a href="https://ssp.sh/brain/uni-taskers/" target="_blank" rel="noopener noreffer">uni-taskers</a>, doing one thing well.</li>
</ul>
]]></description>
</item>
<item>
    <title>Git for Data Applied: Comparing Git-like Tools That Separate Metadata from Data</title>
    <link>https://www.ssp.sh/blog/git-for-data-tools/</link>
    <pubDate>Wed, 04 Mar 2026 00:08:08 &#43;0200</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/git-for-data-tools/</guid><enclosure url="https://www.ssp.sh/blog/git-for-data-tools/featured-image.png" type="image/png" length="0" /><description><![CDATA[<p>Continuing from <a href="/blog/git-for-data-theory" rel="">Part 1</a>, where we learned what git for data is, how the architecture and use cases work, how you can achieve git-like functionality with different approaches, and how the key is to avoid moving data as much as possible to keep state that can be referenced and rolled back to, but at the same time saving cost by not duplicating all data every time you create a new branch.</p>
<p>Now it&rsquo;s time to see what Git-like tools for data are out there, and how they actually work in practice. Part 2 dives into the tools and implementations. We&rsquo;ll examine LakeFS, Dolt, Nessie, MotherDuck, Bauplan, and more, exploring how they work under the hood. Each tool takes a different approach to the same fundamental challenge: enabling Git-like workflows without copying petabytes of data.</p>
<p>The key insight from Part 1 was that all these tools separate metadata from data, using techniques like copy-on-write and pointer manipulation. But the devil is in the details. Some tools version entire data lakes, others focus on databases. Some support full merge workflows, others prioritize instant forking. Understanding these trade-offs will help you choose the right solution for your stack.</p>
<p>There will be gaps, and implementations are changing fast, so take it with a grain of salt. But this should give you a good overview of what&rsquo;s out there, and help you invest more time in the ones that fit your use case best.</p>
<p>Let&rsquo;s get into it.</p>
<h2 id="git-like-tools-overview">Git-like Tools: Overview</h2>
<p>There are many tools out there, some of which have been used for years, and others are rather new. We compare them and see what each of them has to offer.</p>
<h3 id="comparison-overview">Comparison Overview</h3>
<p>The overview below serves as a summary. We will go into more detail, with each tool getting one short chapter with a showcase of features and application use cases.</p>
<table>
  <thead>
      <tr>
          <th>Tool</th>
          <th>Storage Type</th>
          <th>Primary Use Case</th>
          <th>Branching</th>
          <th>Cloning</th>
          <th>Merging</th>
          <th>Snapshot/Time Travel</th>
          <th>Rollback</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/treeverse/lakeFS" target="_blank" rel="noopener noreffer"><strong>LakeFS</strong></a></td>
          <td>Data Lake</td>
          <td>Version control for data lakes</td>
          <td>Full</td>
          <td>Via branching (zero-copy)</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td><a href="https://github.com/dolthub/dolt" target="_blank" rel="noopener noreffer"><strong>Dolt</strong></a></td>
          <td>Database (SQL)</td>
          <td>Versioned SQL database</td>
          <td>Full</td>
          <td>Yes (copy-on-write)</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td><a href="https://github.com/projectnessie/nessie" target="_blank" rel="noopener noreffer"><strong>Nessie</strong></a></td>
          <td>Data Lake</td>
          <td>Catalog-level versioning</td>
          <td>Full</td>
          <td>Yes (zero-copy)</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td><a href="https://www.bauplanlabs.com" target="_blank" rel="noopener noreffer"><strong>Bauplan</strong></a></td>
          <td>Data Lake</td>
          <td>Versioned pipelines</td>
          <td>Data-level</td>
          <td>Yes (zero-copy)</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td><a href="https://motherduck.com" target="_blank" rel="noopener noreffer"><strong>MotherDuck</strong></a></td>
          <td>Data Warehouse</td>
          <td>Serverless data warehouse</td>
          <td>No branching</td>
          <td>Zero-copy clones (differential storage)</td>
          <td>No</td>
          <td>Configurable (named snapshots indefinitely)</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td><a href="https://github.com/duckdb/ducklake" target="_blank" rel="noopener noreffer"><strong>DuckLake</strong></a></td>
          <td>Data Lake</td>
          <td>SQL-native lakehouse</td>
          <td>No</td>
          <td>Via snapshots (zero-copy)</td>
          <td>No</td>
          <td>Yes (unlimited snapshots)</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td><a href="https://github.com/neondatabase/neon" target="_blank" rel="noopener noreffer"><strong>Neon</strong></a></td>
          <td>Database (SQL)</td>
          <td>Branching SQL database</td>
          <td>Full</td>
          <td>Yes (copy-on-write)</td>
          <td>No</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
  </tbody>
</table>
<p><em>It&rsquo;s by no means complete, but it shows the most dominant players.</em></p>
<p>Further analysis of the OSS ecosystem of git for data tools and their GitHub activity tells us how healthy the repos are, as of February 2026:</p>
<table>
  <thead>
      <tr>
          <th>Tool</th>
          <th style="text-align: right">Stars</th>
          <th style="text-align: right">Forks</th>
          <th style="text-align: right">Open Issues</th>
          <th style="text-align: right">Contributors</th>
          <th>Language</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/neondatabase/neon" target="_blank" rel="noopener noreffer">Neon</a></td>
          <td style="text-align: right">21,006</td>
          <td style="text-align: right">890</td>
          <td style="text-align: right">3,040</td>
          <td style="text-align: right">159</td>
          <td>Rust</td>
      </tr>
      <tr>
          <td><a href="https://github.com/dolthub/dolt" target="_blank" rel="noopener noreffer">Dolt</a></td>
          <td style="text-align: right">19,692</td>
          <td style="text-align: right">615</td>
          <td style="text-align: right">490</td>
          <td style="text-align: right">125</td>
          <td>Go</td>
      </tr>
      <tr>
          <td><a href="https://github.com/treeverse/lakeFS" target="_blank" rel="noopener noreffer">lakeFS</a></td>
          <td style="text-align: right">5,130</td>
          <td style="text-align: right">427</td>
          <td style="text-align: right">438</td>
          <td style="text-align: right">114</td>
          <td>Go</td>
      </tr>
      <tr>
          <td><a href="https://github.com/duckdb/ducklake" target="_blank" rel="noopener noreffer">DuckLake</a></td>
          <td style="text-align: right">2,438</td>
          <td style="text-align: right">140</td>
          <td style="text-align: right">79</td>
          <td style="text-align: right">35</td>
          <td>C++</td>
      </tr>
      <tr>
          <td><a href="https://github.com/projectnessie/nessie" target="_blank" rel="noopener noreffer">Nessie</a></td>
          <td style="text-align: right">1,406</td>
          <td style="text-align: right">171</td>
          <td style="text-align: right">156</td>
          <td style="text-align: right">159</td>
          <td>Java</td>
      </tr>
  </tbody>
</table>
<p>And community responsiveness based on <a href="https://ossinsight.io" target="_blank" rel="noopener noreffer">ossinsight.io</a>, latest available month - click on link below to get a deeper insight in each repository:</p>
<table>
  <thead>
      <tr>
          <th>Tool</th>
          <th style="text-align: right">PR Merge Time (p50)</th>
          <th style="text-align: right">Issue First Response (p50)</th>
          <th style="text-align: right">Total Commits</th>
          <th style="text-align: right">Total PR Creators</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://ossinsight.io/analyze/neondatabase/neon" target="_blank" rel="noopener noreffer">Neon</a></td>
          <td style="text-align: right">-</td>
          <td style="text-align: right">-</td>
          <td style="text-align: right">71,756</td>
          <td style="text-align: right">100</td>
      </tr>
      <tr>
          <td><a href="https://ossinsight.io/analyze/dolthub/dolt" target="_blank" rel="noopener noreffer">Dolt</a></td>
          <td style="text-align: right">~0.5 hours</td>
          <td style="text-align: right">~40 hours</td>
          <td style="text-align: right">31,807</td>
          <td style="text-align: right">99</td>
      </tr>
      <tr>
          <td><a href="https://ossinsight.io/analyze/treeverse/lakeFS" target="_blank" rel="noopener noreffer">lakeFS</a></td>
          <td style="text-align: right">~6 hours</td>
          <td style="text-align: right">~23 hours</td>
          <td style="text-align: right">24,956</td>
          <td style="text-align: right">178</td>
      </tr>
      <tr>
          <td><a href="https://ossinsight.io/analyze/duckdb/ducklake" target="_blank" rel="noopener noreffer">DuckLake</a></td>
          <td style="text-align: right">~45 hours</td>
          <td style="text-align: right">~55 hours</td>
          <td style="text-align: right">351</td>
          <td style="text-align: right">27</td>
      </tr>
      <tr>
          <td><a href="https://ossinsight.io/analyze/projectnessie/nessie#overview" target="_blank" rel="noopener noreffer">Nessie</a></td>
          <td style="text-align: right">~750 hours</td>
          <td style="text-align: right">&lt;1 hour (bot-triaged)</td>
          <td style="text-align: right">13,464</td>
          <td style="text-align: right">77</td>
      </tr>
  </tbody>
</table>
<p><em>Note: All data from GitHub API, Feb 2026. Github Activity Chart. See also <a href="https://www.star-history.com/#treeverse/lakeFS&amp;dolthub/dolt&amp;projectnessie/nessie&amp;duckdb/ducklake&amp;tigrisdata/tigris&amp;neondatabase/neon&amp;type=date&amp;legend=top-left" target="_blank" rel="noopener noreffer">GitHub Star History</a></em></p>
<p>Dolt stands out with the fastest PR merge times (~30 min median). lakeFS leads in total PR creators (178), reflecting a broad contributor base. Nessie&rsquo;s near-instant issue response reflects automated triage.</p>
<blockquote>
<p>[!note] How Do They Work?</p>
<p>While Git versions code through file snapshots and diffs, data tools must handle actual data, if possible, without copying entire datasets. Each tool solves this challenge differently, but they share a common approach: <strong>separating metadata from data</strong>.</p>
<p>Instead of duplicating data, they track pointers and references, enabling instant branching/cloning and zero-copy operations.</p>
<p>




<br>
What usually happens without tools like this <a href="https://www.youtube.com/watch?v=z-ATZTUgaAo" target="_blank" rel="noopener noreffer">Testing in Production</a></p>
<p>Find more insight about the architecture and behind the scenes in Part 1, <a href="/blog/git-for-data-theory" rel="">Branch, Test, Deploy: A Git-Inspired Approach for Data</a>.</p>
</blockquote>
<h2 id="git-like-tools-break-down">Git-like Tools: Break down</h2>
<p>Let&rsquo;s get started with the tools and see their features and how they work, categorized into three categories: data lake based, transactional and relational databases, and analytical databases.</p>
<h3 id="data-lake-versioning-object-storage">Data Lake Versioning (Object Storage)</h3>
<p>Data lake versioned tools sit between the compute engine and the object storage (S3, GCS, Azure Blob), leaving you free to query with whatever engine you prefer: Trino, Spark, DuckDB, etc.</p>
<h4 id="lakefs">LakeFS</h4>
<p>LakeFS is one of the first tools to bring git-like versioning to object-storage-based data lakes. Its core approach is a metadata layer over object storage with immutable data and logical-to-physical address mapping on top of an object store such as a data lake, hence &ldquo;lake&rdquo; as part of the name.</p>
<p>It segregates data <code>data/</code> with random physical addresses from its metadata <code>_lakefs/</code>, which includes range files, meta-range files, and commit information.</p>
<p>When you upload <code>allstar_games_stats.csv</code> to branch <code>main</code>, lakeFS generates a random physical address like <code>s3://bucket/data/gp0n1l7d77pn0cke6jjg/cg6p50nd77pn0cke6jk0</code>. This ensures immutability and files are never overwritten.</p>
<p>LakeFS operates as an S3-compatible gateway, intercepting read/write operations and managing versioning transparently. Applications interact with it like normal object storage while getting full Git semantics underneath.</p>
<p>The system implements a layered architecture:</p>
<ol>
<li><strong>Graveler</strong>: Core versioning engine managing branches, commits, and merges</li>
<li><strong>Storage Adapter</strong>: Interfaces with S3/GCS/Azure</li>
<li><strong>Hooks</strong>: Pre-merge and post-commit validation</li>
</ol>













  
<figure><a target="_blank" href="/blog/git-for-data-tools/lakefs-architecture.webp" title="">

</a><figcaption class="image-caption">LakeFS <a href="https://docs.lakefs.io/latest/understand/architecture/" target="_blank" rel="noopener noreffer">Architecture</a> overview</figcaption>
</figure>
<p>Creating a branch from the CLI is as simple as this:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">lakectl branch create lakefs://quickstart/denmark-lakes --source lakefs://quickstart/main
</span></span></code></pre></td></tr></table>
</div>
</div><p>The UI supports creating pull requests, or branches, literally like GitHub but for data.<br>













  
<figure><a target="_blank" href="/blog/git-for-data-tools/lakefs-pr.webp" title="">

</a><figcaption class="image-caption">LakeFS interface, here an example of a <a href="https://docs.lakefs.io/latest/howto/pull-requests/" target="_blank" rel="noopener noreffer">Pull Requests</a></figcaption>
</figure></p>
<p>Check out their <a href="https://github.com/treeverse/lakeFS" target="_blank" rel="noopener noreffer">GitHub repo</a>, <a href="https://docs.lakefs.io/" target="_blank" rel="noopener noreffer">documentation</a>, or a practical example of <a href="https://lakefs.io/blog/write-audit-publish-with-lakefs/" target="_blank" rel="noopener noreffer">Implementing a Write-Audit-Publish (WAP) Pattern</a> for much more information.</p>
<h4 id="nessie">Nessie</h4>
<p><a href="https://github.com/projectnessie/nessie" target="_blank" rel="noopener noreffer">Nessie</a> came out of Dremio and is another early adopter that has been doing this for a long time. Its core approach is a transactional catalog with Git-like versioning for Apache Iceberg and Delta Lake tables.</p>
<p>Rather than versioning data files, Nessie versions the <strong>catalog metadata</strong>, the registry of tables and their locations.</p>
<p>This separation enables <strong>zero-copy branching</strong> where branches share table metadata pointers, <strong>multi-table transactions</strong> with atomic commits across multiple tables, and <strong>Git semantics</strong> such as branch, tag, merge, and cherry-pick operations.</p>
<p>Nessie leverages the immutability of modern table formats with Iceberg:</p>
<ol>
<li><strong>Iceberg snapshots are immutable</strong>: Each table change creates new metadata.</li>
<li><strong>Nessie tracks which snapshot</strong> each branch points to.</li>
<li><strong>Branching copies pointers</strong>, not data or metadata files.</li>
<li><strong>Merging updates pointers</strong> to replay changes from source to target.</li>
</ol>
<p>Example workflow:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span><span class="lnt">9
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Create branch</span>
</span></span><span class="line"><span class="cl"><span class="n">catalog</span><span class="o">.</span><span class="n">create_branch</span><span class="p">(</span><span class="s1">&#39;experiment&#39;</span><span class="p">,</span> <span class="s1">&#39;main&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Modify table on experiment branch</span>
</span></span><span class="line"><span class="cl"><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">&#34;INSERT INTO catalog.experiment.orders VALUES (...)&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1"># This creates new Iceberg snapshot, Nessie updates experiment pointer</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Main branch unchanged - still points to original snapshot</span>
</span></span><span class="line"><span class="cl"><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">&#34;SELECT * FROM catalog.main.orders&#34;</span><span class="p">)</span>  <span class="c1"># Original data</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Nessie runs as a REST service with pluggable backends including metadata storage such as PostgreSQL, DynamoDB, or RocksDB, data lake integration that works with any Iceberg-compatible engine (Spark, Trino, Dremio), and version control with a Git-like commit graph with branches and tags.</p>
<p>Nessie doesn&rsquo;t touch your data files. It&rsquo;s a lightweight coordination layer that brings Git semantics to your lakehouse by versioning the catalog. This makes it complementary to tools like lakeFS (which versions data) and ideal for multi-table transactional workflows. Read more on <a href="https://github.com/projectnessie/nessie" target="_blank" rel="noopener noreffer">GitHub</a>.</p>
<h4 id="bauplan">Bauplan</h4>
<p>Similar to LakeFS, Bauplan calls itself the programmable data lake and is a code-native platform for versioned pipelines, built on Apache Iceberg and initially optimized for ML. It&rsquo;s not open source. Bauplan is built on a Python-first serverless lakehouse and is rather new.</p>
<p>Bauplan treats your data lake as a Git repository where:</p>
<ul>
<li><strong>Data branches</strong> are first-class citizens, not just pipeline configs.</li>
<li>Every pipeline execution is a commit with full lineage.</li>
<li>All tables use Apache Iceberg format (Delta Lake compatible).</li>
</ul>













  
<figure><a target="_blank" href="/blog/git-for-data-tools/bauplan2.webp" title="">

</a><figcaption class="image-caption">Architectural overview from <a href="https://www.bauplanlabs.com/" target="_blank" rel="noopener noreffer">Bauplan Website</a></figcaption>
</figure>
<p>Creating an isolated branch with new snapshots of Iceberg tables from the CLI is as simple as this:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">client</span><span class="o">.</span><span class="n">create_branch</span><span class="p">(</span><span class="s1">&#39;experiment&#39;</span><span class="p">)</span>  <span class="c1"># Instant, zero data copying</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>It supports merging verified using <a href="https://alloytools.org/" target="_blank" rel="noopener noreffer">Alloy</a> model checking:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">client</span><span class="o">.</span><span class="n">merge_branch</span><span class="p">(</span><span class="n">source</span><span class="o">=</span><span class="s1">&#39;experiment&#39;</span><span class="p">,</span> <span class="n">target</span><span class="o">=</span><span class="s1">&#39;main&#39;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>The way it works is that it integrates a commit&rsquo;s changes into another branch and uses Alloy, a lightweight model checker, to stress-test the core logic behind merging (also used for checking branching and commits).</p>
<p>The merge operation tries to detect conflicts at the table level, performs three-way merges for compatible changes, and creates merge commits preserving lineage. Find more info on <a href="https://www.bauplanlabs.com/post/git-for-data-formal-semantics-of-branching-merging-and-rollbacks-part-1" target="_blank" rel="noopener noreffer">Git-for-Data Semantics: Safe Branching &amp; Merging at Scale</a> or their implementation of the <a href="https://www.bauplanlabs.com/post/write-audit-publish-ship-data-safely-move-faster" target="_blank" rel="noopener noreffer">WAP pattern</a>.</p>
<p>Bauplan brings Git&rsquo;s full semantic model with branch, merge, commit, and revert to lakehouse data while maintaining compatibility with standard Iceberg tables accessible from MotherDuck, Snowflake, Databricks, or Trino.</p>
<blockquote>
<p>[!tip] Software Modeling with Alloy</p>
<p>I haven&rsquo;t heard of Alloy before, but it&rsquo;s used not to model data, but for software modeling. It&rsquo;s used for a wide range of applications from finding holes in security mechanisms to designing telephone switching networks. And now for git for data with Bauplan.</p>
</blockquote>
<blockquote>
<p>[!note] New Whitepaper Out</p>
<p>After this article was written, Bauplan released a new whitepaper on <a href="https://arxiv.org/pdf/2602.02335" target="_blank" rel="noopener noreffer">Building a Correct-by-Design Lakehouse</a> that researches around pipeline boundaries with Git-like data versioning for review and reproducibility, and transactional runs that guarantee pipeline-level atomicity.</p>
</blockquote>
<h3 id="transactional-and-oltp-databases">Transactional and OLTP Databases</h3>
<p>These are row-oriented, ACID-compliant databases where Git-like versioning applies mostly to application data where we need to keep user records, orders, and schemas.</p>
<p>Supabase, Neon and Dolt are interesting because these are not data lakes, not based on object storage, and not analytical databases, but relational databases.</p>
<h4 id="supabase">Supabase</h4>
<p><a href="https://supabase.com/docs" target="_blank" rel="noopener noreffer">Supabase</a>&rsquo;s core approach is full instance branching. Each branch is a completely isolated Postgres database with the entire Supabase stack (Auth, Storage, Realtime, Edge Functions).</p>
<p>Supabase branches create <strong>separate environments</strong> that spin off from your main project, allowing you to test changes like new configurations, database schemas, or features without affecting production.</p>
<p>It works by creating a Git branch and opening a pull request. Supabase automatically launches a Preview Branch and runs migrations from the repository&rsquo;s migrations directory. Each branch gets a dedicated Postgres instance with a unique connection string and APIs, isolating them from production and other branches.</p>
<p>Creating a branch via GitHub integration:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># Automatic with GitHub integration enabled</span>
</span></span><span class="line"><span class="cl">git checkout -b feature/new-reports
</span></span><span class="line"><span class="cl">git push origin feature/new-reports
</span></span><span class="line"><span class="cl"><span class="c1"># Supabase automatically creates preview branch when PR is opened</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Or via the CLI:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">supabase branches create feature-branch --project-ref your-project
</span></span></code></pre></td></tr></table>
</div>
</div><p>When merging, migrations in the repository&rsquo;s migrations folder run incrementally on each commit, allowing you to verify schema changes on existing seed data. When you merge the PR, those migrations automatically apply to production.</p>
<p>As each branch is a new Postgres instance created from scratch, the approach is conceptually simple but requires branches to be seeded (manually populated with test data since production data isn&rsquo;t copied) with data since they start empty. Each branch incurs its own compute and storage costs. Read more on <a href="https://supabase.com/docs/guides/deployment/branching" target="_blank" rel="noopener noreffer">Branching Supabase Docs</a>.</p>
<p>Ideal for full-stack development where you need the entire backend stack (database + auth + storage + functions) to test features end-to-end.</p>
<h4 id="neon">Neon</h4>
<p><a href="https://neon.com/docs/" target="_blank" rel="noopener noreffer">Neon</a> is a serverless Postgres platform (now part of Databricks) whose core approach is <strong>copy-on-write storage-level branching</strong>. Unlike Supabase which spins up a full new instance, Neon <a href="https://neon.com/docs/introduction/branching" target="_blank" rel="noopener noreffer">branches</a> at the storage layer, making them instant regardless of database size and including the actual data.</p>
<p>Each branch is a new timeline in Neon&rsquo;s custom storage engine. No data is physically copied. The branch simply starts from a pointer to the parent&rsquo;s state at a specific LSN (log sequence number). Pages only diverge when writes happen, so you&rsquo;re billed only for the delta.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># Create a branch from the CLI</span>
</span></span><span class="line"><span class="cl">neon branches create --name feature/user-auth
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Branch from a specific point in time</span>
</span></span><span class="line"><span class="cl">neon branches create --name recovery --parent 2025-01-15T10:00:00Z
</span></span></code></pre></td></tr></table>
</div>
</div><p>Neon also supports <strong><a href="https://neon.com/docs/ai/ai-database-versioning" target="_blank" rel="noopener noreffer">snapshots</a></strong> (named, immutable point-in-time saves, like git tags) and <strong>rollback</strong> via <code>finalize_restore: true</code>, which restores a snapshot onto the active branch in-place while preserving the stable connection string.  There&rsquo;s no reconfiguration needed. For safe experimentation, <code>finalize_restore: false</code> creates a temporary preview branch instead.</p>
<p>The key limitation: <strong>Neon has no merge support</strong>. Branches diverge but can&rsquo;t be reconciled automatically. Changes are applied back to production using standard migration tools.</p>
<p>Ideal for database-focused workflows where you want instant, full-data branches with production-like data out of the box, and don&rsquo;t need the full backend stack.</p>
<h4 id="dolt-git--mysql">Dolt: Git + MySQL</h4>
<p><a href="https://github.com/dolthub/dolt" target="_blank" rel="noopener noreffer">Dolt</a> is a SQL database that you can fork, clone, branch, merge, push, and pull just like a Git repository. It&rsquo;s a MySQL-compatible database and is fully open-source. Dolt&rsquo;s core approach is a SQL database where every row is versioned, combining Git&rsquo;s commit graph with MySQL&rsquo;s query interface.</p>
<p>Dolt stores data in a <strong>content-addressed graph</strong> using <a href="https://docs.dolthub.com/architecture/storage-engine/prolly-tree" target="_blank" rel="noopener noreffer">Prolly Trees</a>, a novel data structure that enables cell-level version history, efficient structural sharing between versions, and fast diffs and merges.</p>
<p>Every database operation can be committed with:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">employees</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Alice&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">50000</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">DOLT_COMMIT</span><span class="p">(</span><span class="s1">&#39;-am&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Add Alice to payroll&#39;</span><span class="p">);</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>The commit creates a snapshot of the entire database state at that moment, stored in the commit graph just like Git. Unlike traditional databases, you can <strong>diff any two versions</strong>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="c1">-- See what changed between commits
</span></span></span><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">DOLT_DIFF</span><span class="p">(</span><span class="s1">&#39;main&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;feature-branch&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;employees&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">-- Show cell-level changes
</span></span></span><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">DOLT_COMMIT_DIFF_employees</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">from_commit</span><span class="o">=</span><span class="s1">&#39;abc123&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">to_commit</span><span class="o">=</span><span class="s1">&#39;def456&#39;</span><span class="p">;</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>This enables <strong>cell-level audit trails</strong> with diffs showing which rows were added/deleted/modified, which cells changed with their before/after values, and who made the change via commit metadata.</p>
<p>Dolt implements Git commands almost literally. You can run <code>dolt</code> with any of these commands: <code>branch feature-123</code>, <code>checkout feature-123</code>, <code>add .</code>, <code>commit -m &quot;Add new customers&quot;</code>, <code>push origin feature-123</code>, <code>checkout main</code>, <code>merge feature-123</code>.</p>
<p>You can even push/pull to DoltHub (like GitHub for databases) or run Dolt as a MySQL replica for existing applications.</p>
<p>Dolt uses <strong>copy-on-write with structural sharing</strong> where unchanged rows are shared between branches via pointers, and modified rows create new leaf nodes in the Prolly Tree.</p>
<p>This means cloning isn&rsquo;t &ldquo;free&rdquo; like with lakeFS, but it provides true database semantics with ACID transactions.</p>
<p>There&rsquo;s much more. Read more on their <a href="https://github.com/dolthub/dolt" target="_blank" rel="noopener noreffer">GitHub</a>.</p>
<blockquote>
<p>[!note] Worth noting</p>
<p><a href="https://docs.doltgres.com" target="_blank" rel="noopener noreffer">DoltgreSQL</a>, the Postgres-compatible version of Dolt, reached Beta in 2025 and is available on Hosted Dolt. If your stack is Postgres-based, DoltgreSQL brings the same Git-like versioning semantics without requiring a MySQL migration.</p>
</blockquote>
<h3 id="analytical-databases--warehouses">Analytical Databases &amp; Warehouses</h3>
<p>These tools are OLAP-style and analytical-style databases optimized for read-heavy analytical queries.</p>
<h4 id="motherduck">MotherDuck</h4>
<p>MotherDuck, as a cloud data warehouse, implements versioning differently from dedicated Git-for-data tools, prioritizing operational convenience over full version control semantics. With the addition of <strong><a href="https://motherduck.com/docs/concepts/snapshots/" target="_blank" rel="noopener noreffer">named snapshots</a></strong>, it gets even closer to Git-like semantics.</p>
<p>It offers two types of snapshots. <strong>Automatic snapshots</strong>: Created continuously in the background (roughly every minute when no writes are active). These are governed by <code>SNAPSHOT_RETENTION_DAYS</code>. These are configurable up to 90 days on the Business plan, defaulting to 7 days. They provide point-in-time recovery without any manual intervention.</p>
<p>And <strong>named snapshots</strong> that you create explicitly with <code>CREATE SNAPSHOT</code>. These are not subject to garbage collection as they persist indefinitely, even if the source database is deleted. Think of them as <strong>Git tags for your database</strong>, a permanent bookmark of a known-good state you can always return to.</p>
<p>The git analogy maps well:</p>
<ol>
<li><strong><code>CREATE SNAPSHOT</code></strong> → <code>git tag</code>:  bookmark a known-good state</li>
<li><strong><code>CREATE DATABASE ... FROM</code></strong> → <code>git checkout -b</code>: isolated environment from a snapshot</li>
<li><strong><code>ALTER DATABASE SET SNAPSHOT TO</code></strong> → <code>git reset --hard</code>: roll back to a previous state</li>
<li><strong><code>UNDROP DATABASE</code></strong> → recovering a deleted branch</li>
</ol>
<p>Combined with <a href="https://motherduck.com/docs/sql-reference/motherduck-sql-reference/create-database/" target="_blank" rel="noopener noreffer">zero-copy cloning</a> and <a href="https://motherduck.com/docs/key-tasks/sharing-data/sharing-overview/" target="_blank" rel="noopener noreffer">database sharing</a>, this enables practical git-like workflows. While MotherDuck doesn&rsquo;t support Git-style merging, <a href="https://motherduck.com/docs/sql-reference/motherduck-sql-reference/copy-database-overwrite/" target="_blank" rel="noopener noreffer"><code>COPY FROM DATABASE (OVERWRITE)</code></a> acts as a replace, somewhat like a merge without conflict resolution. Combined with snapshots and <a href="https://motherduck.com/docs/sql-reference/motherduck-sql-reference/create-database/" target="_blank" rel="noopener noreffer">zero-copy clones</a>, this gives you a practical branch-modify-promote workflow:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="c1">-- 1. Snapshot production before changes (persists indefinitely)
</span></span></span><span class="line"><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">SNAPSHOT</span><span class="w"> </span><span class="s1">&#39;pre_release_v2&#39;</span><span class="w"> </span><span class="k">OF</span><span class="w"> </span><span class="n">production</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">-- 2. Clone from that named snapshot to an isolated dev database (instant, zero-copy)
</span></span></span><span class="line"><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">DATABASE</span><span class="w"> </span><span class="n">dev_branch</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">production</span><span class="w"> </span><span class="p">(</span><span class="n">SNAPSHOT_NAME</span><span class="w"> </span><span class="s1">&#39;pre_release_v2&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">-- Or clone from a point in time: (SNAPSHOT_TIME &#39;2026-01-28 08:00:00&#39;)
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">-- 3. Make and validate changes on dev_branch
</span></span></span><span class="line"><span class="cl"><span class="c1">-- ... run transforms, test queries ...
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">-- 4. Promote: overwrite production with dev_branch (instant, metadata-only)
</span></span></span><span class="line"><span class="cl"><span class="k">COPY</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="k">DATABASE</span><span class="w"> </span><span class="n">dev_branch</span><span class="w"> </span><span class="p">(</span><span class="n">OVERWRITE</span><span class="p">)</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="n">production</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">-- 5. If something goes wrong, restore from snapshot
</span></span></span><span class="line"><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">DATABASE</span><span class="w"> </span><span class="n">production</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">SNAPSHOT</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="p">(</span><span class="n">SNAPSHOT_NAME</span><span class="w"> </span><span class="s1">&#39;pre_release_v2&#39;</span><span class="p">);</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>This operates purely at the metadata layer and is nearly instantaneous. It&rsquo;s not a true merge (it&rsquo;s a full replacement, not a diff-based reconciliation), but for many data workflows where you want to validate changes in isolation before promoting them, it covers the key use case.</p>
<blockquote>
<p>[!example] Deep Dive</p>
<p>If you want to know even more about how to use named snapshots and generally rolling back to a certain time, this blog <a href="https://motherduck.com/blog/point-in-time-restore/" target="_blank" rel="noopener noreffer">More Control, Less Hassle: Self-Serve Recovery with Point-in-Time Restore</a> goes into more details.</p>
</blockquote>
<h4 id="ducklake">DuckLake</h4>
<p><a href="https://ducklake.select/" target="_blank" rel="noopener noreffer">DuckLake</a> is the open lakehouse format that uses a SQL database as its metadata catalog instead of JSON/Avro manifest files. DuckLake is relatively new (with 1.0 around the corner and its first release in May 2025), so you could use other mature open table formats like <a href="https://github.com/apache/iceberg" target="_blank" rel="noopener noreffer">Apache Iceberg</a>, <a href="https://github.com/delta-io/delta" target="_blank" rel="noopener noreffer">Delta Lake</a> or <a href="https://github.com/apache/hudi" target="_blank" rel="noopener noreffer">Apache Hudi</a>.</p>
<p>But DuckLake has its relevancy for git-like workflows because:</p>
<ol>
<li><strong>Snapshots are Git commits</strong>: Every DuckLake change creates a snapshot with author, commit message, and changeset tracking. This is the closest to actual Git semantics in the data lake world.</li>
<li><strong>SQL-native metadata</strong>: Uses DuckDB/PostgreSQL/MySQL as catalog, so metadata operations are standard SQL transactions. No manifest file scanning or compaction storms like Iceberg.</li>
<li><strong>Millions of snapshots</strong>: Snapshots are just a few rows in the catalog DB. No need to proactively prune snapshots (a major operational burden with Iceberg).</li>
<li><strong>Time travel + change feed</strong>:  Query any table at any version, track insertions/deletions between versions.</li>
</ol>
<p><strong>With MotherDuck</strong> (fully managed):</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="c1">-- Fully managed DuckLake on MotherDuck
</span></span></span><span class="line"><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">DATABASE</span><span class="w"> </span><span class="n">my_lake</span><span class="w"> </span><span class="p">(</span><span class="k">TYPE</span><span class="w"> </span><span class="n">DUCKLAKE</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">-- Or bring your own S3 bucket
</span></span></span><span class="line"><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">DATABASE</span><span class="w"> </span><span class="n">my_lake</span><span class="w"> </span><span class="p">(</span><span class="k">TYPE</span><span class="w"> </span><span class="n">DUCKLAKE</span><span class="p">,</span><span class="w"> </span><span class="n">DATA_PATH</span><span class="w"> </span><span class="s1">&#39;s3://my-bucket/lake/&#39;</span><span class="p">);</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><blockquote>
<p>[!example] DuckLake Example</p>
<p>See valuable examples and DuckLake workflows in <a href="https://github.com/matsonj/ducklake-workshop" target="_blank" rel="noopener noreffer">DuckLake workshop</a>.</p>
</blockquote>
<h2 id="related-data-engineering-git-like-workflows">Related Data Engineering Git-like Workflows</h2>
<p>Besides storage for data, which is the most important part and at the same time the hardest as we need to deal with state, it&rsquo;s not the full picture. We have DataOps to handle the full picture.</p>
<p>Data pipelines and their code also need to be deployed on a clone or branch, so how do we do this? One example is orchestration.</p>
<h3 id="orchestration-dagster-branch-deployments">Orchestration: Dagster Branch Deployments</h3>
<p>If we look at the full picture of the data engineering lifecycle, we need more than just storing data in a git-like manner. To support the full lifecycle, it would be best to run everything in a git-like style to roll back or switch branches. It&rsquo;s great to see that orchestrator tools like Dagster and others also have this functionality included.</p>
<p>Meaning branching does not only apply to the data, but also to data pipelines, and we can set a run automatically. Dagster is doing that with their cloud solution, integrating GitHub workflows with PRs and actions.</p>
<p>Dagster&rsquo;s core approach is lightweight staging environments created automatically with every pull request that branch both code <em>and</em> data. <strong><a href="https://docs.dagster.io/deployment/dagster-plus/deploying-code/branch-deployments" target="_blank" rel="noopener noreffer">Branch deployments</a></strong> deploy your branch on Dagster+ as a separate deployment. This only works if your underlying technology supports cloning. For example, as we&rsquo;ve seen, one of the above tools that supports cloning will allow Dagster inside the deployment to clone relevant data into that new branch deployment.</p>













  
<figure><a target="_blank" href="/blog/git-for-data-tools/dagster.webp" title="">

</a><figcaption class="image-caption">Branch deployment workflow showing how code branches deploy to cloned schema</figcaption>
</figure>
<p>On PR creation, it will automatically create a staging environment with a branch, launch jobs to configure the test environment including cloned data(base), and allow parameterized pipelines to test. If the tests pass, you can approve the PR, and it merges and automatically deploys to production with the right CI/CD pipeline.</p>
<p>Orchestrators and other data stack tools depend on cloning support and features such as branching for a true isolated environment. As Nick Schrock noted in the <a href="https://www.dataengineeringpodcast.com/dagster-software-defined-assets-data-orchestration-episode-309/" target="_blank" rel="noopener noreffer">Data Engineering Podcast</a>, this is similar to the challenge with Apache Spark where testing locally is nearly impossible. Branch deployments solve this by branching the entire environment.</p>
<p>This is extremely powerful as it replaces the need to copy data locally or set up complex staging environments. You get a true production-like test environment that&rsquo;s automatically created and destroyed with your git workflow. Read more on <a href="https://docs.dagster.io/dagster-plus/managing-deployments/branch-deployments" target="_blank" rel="noopener noreffer">Dagster Branch Deployments</a>.</p>
<h3 id="ai-agents-a-branch-for-testing">AI Agents: A Branch for Testing</h3>
<p>Lastly, this also works well in the realm of AI agents that help us test based on a branch or snapshot. This is similar to <a href="https://git-scm.com/docs/git-worktree" target="_blank" rel="noopener noreffer">git worktree</a> for small git repos with code where basically each branch is a separate folder and we can work and change different branches simultaneously without breaking any of the other branches or data.</p>
<p>Once we have a working branch with data <strong>included in isolation</strong>, we can send off an agent autonomously, and let it open a PR to review. This way we have a clear gateway before it goes to production, we can test it on that branch, including its data, and merge when all looks good.</p>
<p>Based on its own fork, we can avoid collisions, instantly roll back or delete a branch and start again, have perfect consistency as data is frozen and locked for the agent to work on, and clean debugging as no other ETL data pipelines interfere.</p>
<h2 id="conclusion">Conclusion</h2>
<p>So where does this leave us? In <a href="/blog/git-for-data-theory" rel="">Part 1</a>, we established that Git for data is fundamentally harder than versioning code because we&rsquo;re managing state at massive scale. We learned about the efficiency spectrum, from metadata pointers to full copies, and why zero-copy operations matter.</p>
<p>Now, having explored the actual tools and their approaches to git-like workflows (LakeFS, Dolt, Nessie, MotherDuck, and others in production today), we know a little more about how it all works. Each tool makes different trade-offs, but they all solve the same core problem: how do you version data without copying petabytes.</p>
<p>The answer, to me: <strong>separate metadata from data</strong>. Whether it&rsquo;s LakeFS&rsquo;s random physical addresses, Dolt&rsquo;s Prolly Trees, Nessie&rsquo;s catalog pointers, MotherDuck&rsquo;s zero-copy clones, or Neon&rsquo;s branching feature, they all use clever tricks to make branching instant. Some focus on data lakes, others on databases. Some support full merge workflows, others prioritize instant forking. Your choice depends on your stack:</p>
<ul>
<li>LakeFS and Nessie excel at data lake branching with zero-copy efficiency</li>
<li>Dolt brings true Git semantics to SQL databases</li>
<li>MotherDuck offers named snapshots and zero-copy clones for cloud data warehousing, with DuckLake adding SQL-native time travel</li>
<li>Bauplan focuses on versioned pipelines and ML experiment reproducibility</li>
<li>Neon and Supabase provide branch/fork-based workflows for isolated testing</li>
</ul>
<p>The ecosystem is still evolving. Maturity varies across tools, with different workarounds to limitations that best fit data in a git-like workflow. Some trade merge capabilities for instant forking. Others require infrastructure changes. The key is picking what fits your workflow and scale.</p>
<p><strong>Start small.</strong> You don&rsquo;t need to instrument your entire stack overnight. Look at your recent production incidents: which pipelines caused them? Those are your highest-risk areas. Add branching there first. Test changes on prod-like data before deploying. Build confidence through small wins, then expand.</p>
<p>We want to bring the same <strong>confidence</strong> we have with code versioning to the stateful world of data. And with tools like Dagster&rsquo;s branch deployments and emerging AI agent workflows, we&rsquo;re seeing Git-like patterns extend beyond just data storage into the full data engineering lifecycle.</p>
<p>Git-like workflows are becoming table stakes. Maybe not today or tomorrow, but with the right tools and changes in workflow we can achieve significantly better change management, testing on production data, fast rollbacks, isolated experiments, and most importantly, peace of mind when deploying changes.</p>
<p>That&rsquo;s the promise. What&rsquo;s your experience? Have you tried it? Do you run any of the above in production? I&rsquo;m curious to hear more.</p>
<h2 id="appendix">Appendix</h2>
<p>While I was writing this article back in November 2025, Tigris was an interesting database contender with Supabase-like features such as forked buckets and zero clone. But at the time of this publishing, the <a href="https://github.com/tigrisdata-archive/tigris" target="_blank" rel="noopener noreffer">GitHub repo</a> got archived, and therefore removed from the comparison in this article.</p>
<hr>
<pre class=""><em>Full article published at <a href="https://motherduck.com/blog/git-for-data-part-2/" target="_blank" rel="noopener noreferrer">MotherDuck.com</a> - written as part of <a href="/services">my services</a></em></pre>
]]></description>
</item>
<item>
    <title>Building an Obsidian RAG with DuckDB and MotherDuck</title>
    <link>https://www.ssp.sh/blog/obsidian-rag-duckdb-sql/</link>
    <pubDate>Fri, 13 Feb 2026 00:00:08 &#43;0200</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/obsidian-rag-duckdb-sql/</guid><enclosure url="https://www.ssp.sh/blog/obsidian-rag-duckdb-sql/featured-image.png" type="image/png" length="0" /><description><![CDATA[<p>I always wanted a personal knowledge assistant based on my notes. One that uses Obsidian&rsquo;s backlinks and connections to surface ideas I&rsquo;ve forgotten or never thought to link together.</p>
<p>So I built one. A RAG system that runs locally with DuckDB as a <a href="/blog/vector-technologies-ai-data-stack/" rel="">vector database</a>, then syncs to MotherDuck for a serverless web app running entirely in the browser via WASM. Think of it like J.A.R.V.I.S<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> for your markdown files: search about a topic, and it shows connected notes up to two hops away, semantically similar content, and hidden connections between ideas that share no direct links.</p>
<p>In this article, I walk through how I built this and how it works, from using DuckDB&rsquo;s vector extension locally to serving embeddings through MotherDuck&rsquo;s WASM client. Along the way, you&rsquo;ll see how data engineering skills can make use of lots of note-markdown files. If you want to dive straight into the code, it&rsquo;s all on GitHub at <a href="https://github.com/sspaeti/obsidian-note-taking-assistant" target="_blank" rel="noopener noreffer">Obsidian-note-taking-assistant</a>, and you can try the web app on my public notes at <a href="https://explore.ssp.sh" target="_blank" rel="noopener noreffer">Explore RAG</a>.</p>
<p>For building the web app I used Claude Code and it came together in a few hours using the <code>plan mode</code>. This approach is powerful for any data engineer building pipelines or related work, especially when you have a clear vision of what you want. The big productivity boost wasn&rsquo;t only the model getting smarter, in my opinion, but something else, more on that in the article.</p>
<p>This is how it looks. Let&rsquo;s talk about how I built it and some behind the scenes.<br>













  
<figure><a target="_blank" href="/blog/obsidian-rag-duckdb-sql/output3.gif" title="">

</a><figcaption class="image-caption">Short showcase of the web app, working locally or as shown here published on Vercel</figcaption>
</figure></p>
<h2 id="vision--why-i-built-this">Vision &amp; Why I Built This</h2>
<p>I have 8963 local notes (according to <code>find . -type f -name '*.md' | wc -l</code>) in my Obsidian vault, some are very long, and there are more images and PDFs connected. Wouldn&rsquo;t it be nice to have an insight from my own thinking a while back, or some quotes I forgot<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>, or things you didn&rsquo;t think of?</p>
<p>The requirements that I set myself were to use Obsidian backlinks as these are already curated and well structured as a graph-like organization. I wanted to see notes that are multiple hops away and hard to see without a tool. I wanted to search non-obvious neighbors or similarities and also show me hidden connections that would be interesting, both locally and online. These are especially helpful in the brainstorming and initial phase when starting an article or a note, giving me new ideas on existing notes I have written once in my life.</p>
<p>Examples could look like this:</p>
<blockquote>
<p>Show me my notes on Functional Data Engineering that relate to my current article (one or two hops)</p>
</blockquote>
<blockquote>
<p>Notes that are relevant from my vault. Or related ideas</p>
</blockquote>
<blockquote>
<p>Highlight any disagreements between the notes</p>
</blockquote>
<blockquote>
<p>Give me all notes I took on these matters and related, and give me the source note from my Obsidian vault</p>
</blockquote>
<p>Such a tool is especially helpful during brainstorming when writing my articles, or when I journal some ideas or when solving a hard problem. All of this should be local, but also available as a web app, so I can share it with you and connect it to my public second brain.</p>
<h3 id="starting-position">Starting Position</h3>
<p>With Obsidian, there are many Obsidian plugins such as <a href="https://github.com/SkepticMystic/graph-analysis" target="_blank" rel="noopener noreffer">Graph Analysis</a>, <a href="https://github.com/brianpetro/obsidian-smart-connections" target="_blank" rel="noopener noreffer">Obsidian Smart Connections</a> and many more, that let you do similar things. But some require to hook up a public AI provider, don&rsquo;t work very well anymore, or don&rsquo;t do exactly what I wanted.</p>
<p>The easiest would be to use Claude Code or any other agents, as it&rsquo;s just Markdown files, but again, then you <strong>give away all your sensitive, potentially insightful notes</strong> and thoughts. That&rsquo;s why I wanted to build an Obsidian knowledge assistant that is trained based on my data. I started with a simple Retrieval-Augmented Generation (RAG) system that uses DuckDB for storing vectors. I used <a href="https://duckdb.org/docs/stable/core_extensions/vss" target="_blank" rel="noopener noreffer">Vector Similarity Search Extension</a> for storing vectors and did a couple of tests with Claude Code.</p>
<p>I shared it online and got <a href="https://www.linkedin.com/feed/update/urn:li:activity:7417544619158171648?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A7417544619158171648%2C7417588137956245506%29&amp;replyUrn=urn%3Ali%3Acomment%3A%28activity%3A7417544619158171648%2C7417601077690351616%29&amp;dashCommentUrn=urn%3Ali%3Afsd_comment%3A%287417588137956245506%2Curn%3Ali%3Aactivity%3A7417544619158171648%29&amp;dashReplyUrn=urn%3Ali%3Afsd_comment%3A%287417601077690351616%2Curn%3Ali%3Aactivity%3A7417544619158171648%29" target="_blank" rel="noopener noreffer">helpful feedback</a> to use a specific model <a href="https://huggingface.co/BAAI/bge-m3" target="_blank" rel="noopener noreffer">bge-m3</a> and integrated it as much as possible with the help of agents. I added the above requirements that it should use Obsidian native links and train based on my vault.</p>
<p>This was my first round. Building a job that creates chunks and ingests them into DuckDB with the vector extension <a href="https://duckdb.org/docs/stable/core_extensions/vss" target="_blank" rel="noopener noreffer">Vector Similarity Search Extension</a>.</p>
<p>I used two different modes, as the above takes more time to generate embeddings. I could run the BGE-M3 overnight and it was done after ~2 hours, not on all my notes, but on my public notes, which are 584.</p>













  
<figure><a target="_blank" href="/blog/obsidian-rag-duckdb-sql/btop.webp" title="">

</a><figcaption class="image-caption">Running btop as activity overview while running the ingestion and creating embeddings on my laptop - Using mostly CPU at 45%</figcaption>
</figure>
<h3 id="local-first">Local-First</h3>
<p>I started with the local-first approach because I want to be independent, and also I have sensitive or valuable notes that I don&rsquo;t just want to give away or upload to the cloud.</p>
<p>But there are also other reasons why you might want to use a local model. Some say:</p>
<blockquote>
<p>A.I. research done by a cloud service will hallucinate because you have <strong>no control over the weights or limits of the LLM</strong>. This is why anyone who wants to do A.I. should run their projects locally including Deep Research. <a href="https://bsky.app/profile/gostack.bsky.social/post/3mdcvdzglus2a" target="_blank" rel="noopener noreffer">Bsky</a></p>
</blockquote>
<p>Additionally, a local model with lots of your own context to research with will be better suited for your use case. It doesn&rsquo;t mean that it does not hallucinate, but what I find most useful is that suggestions and ideas are based on my own notes, which I sometimes have forgotten, or if new ideas, they are combined based on my research.</p>
<h3 id="web-app">Web App</h3>
<p>I added a web app that uploads the generated embeddings to MotherDuck and uses <a href="https://duckdb.org/docs/stable/clients/wasm/overview" target="_blank" rel="noopener noreffer">DuckDB WASM</a> to serve in the client (web browser), so I could share the findings easily with anyone interested in my second brain notes.</p>
<p>This went really well, and I share all the details at the end of this article, with some lessons learned and how you can do it for yourself too.</p>
<h2 id="knowledge-assistant-building-a-rag-for-data-engineers">Knowledge Assistant: Building a RAG for Data Engineers</h2>
<p>Now let&rsquo;s get to the building part. As initially explained, this article converts data engineering knowledge into a searchable tool. Hopefully finding new insights, related topics, and learning something new.</p>
<p>This is now done on top of my <a href="https://www.ssp.sh/brain" target="_blank" rel="noopener noreffer">public (mostly) data engineering notes</a>, but we might add code snippets, interesting quotes, etc. To me, all of these might just be text files, and mostly markdown, that&rsquo;s why this system based on text files is so powerful. We can use it as context to help us more.</p>
<p>The outcome and connected web app looks like this:</p>
<p>






</p>
<h3 id="what-we-built-retrieval-without-the-llm">What We Built: Retrieval Without the LLM</h3>
<p>A <a href="https://motherduck.com/blog/search-using-duckdb-part-2/" target="_blank" rel="noopener noreffer">Retrieval-Augmented Generation (RAG)</a> system that is trained on our notes that we have (we use Markdown). More specifically: Obsidian Markdown, that has the advantage of links and backlinks that give us additional clues we can use.</p>
<p>RAG in particular is a technique that can provide more accurate results to queries than a generative large language model on its own because RAG uses knowledge external to data already contained in the Large Language Models (LLMs).</p>
<p>So what we built is only the Retrieval and Augmented part. We don&rsquo;t use an LLM yet, only retrieval of relevant and hidden notes based on a search. Specifically notes, code snippets as parts of notes, and other relevant ideas.</p>
<h3 id="architecture-with-embed-model-motherduck-and-nextjs">Architecture with Embed Model, MotherDuck and Next.js</h3>
<p>First I had to split my notes into separate chunks and connect relevant links.<br>
This is done through an embedding model that converts text into numerical vectors, so we can compare meaning rather than just keywords.</p>
<p>This runs locally and two models can be used: <strong>all-MiniLM-L6-v2</strong> (384 dimensions, fast for testing) and <strong>BAAI/bge-m3</strong> (1024 dimensions, production quality). This is the top-level Python code in the GitHub repo. It <strong>provides a CLI and DuckDB database</strong> where we can search semantically, discover hidden notes, or traverse connected notes up to two hops away.</p>
<p>The chunking is markdown-aware: it respects heading boundaries, preserves code blocks intact, and splits on paragraph breaks. Each chunk stays around <strong>512 characters</strong> and carries its heading context along. Before embedding, I prepend the note title and section heading to each chunk (e.g., <code>&quot;Title: DuckDB | Section: Installation | actual content...&quot;</code>).</p>
<p>This acts as a semantic anchor and noticeably improves retrieval quality.</p>
<p>Disclaimer: I don&rsquo;t have deep expertise in building RAG systems and semantic search, so this is built on the best of my knowledge and what helps me most in my daily work.</p>
<p>The ingestion pipeline creates these tables with relevant information:</p>
<ul>
<li>notes: Note metadata, content, frontmatter</li>
<li>links: Wikilink graph edges</li>
<li>chunks: Chunked content for RAG retrieval</li>
<li>embeddings: 1024-dim vectors (BAAI/bge-m3)</li>
<li>hyperedges: Multiway relations (tags, folders)</li>
<li>hyperedge_members: Note membership in hyperedges</li>
</ul>
<p>The second part is a <strong>web app</strong> served via a Next.js UI and a MotherDuck WASM client that connects directly to the MotherDuck cloud database from the browser.</p>
<p>This means no database server to set up or maintain. I added a FastAPI service on Railway to serve the BGE-M3 embedding model, which avoids API costs from Hugging Face (and also makes it reliable, since Hugging Face&rsquo;s inference API kept timing out with the BGE-M3 model).</p>
<p>The architecture uses mostly serverless components:<br>













  

























<figure>
<a target="_blank" href="/blog/obsidian-rag-duckdb-sql/mermaid.png" title="/blog/obsidian-rag-duckdb-sql/mermaid.png">

</a><figcaption class="image-caption">Simple Architecture of this Project</figcaption>
</figure></p>
<p>Semantic search matches <strong>meaning</strong>, not keywords. When I search for &ldquo;how to model data in a warehouse,&rdquo; I want notes about dimensional modeling or dbt transformations to show up, even if they never use those exact words.</p>
<p>The BGE-M3 model converts each chunk into a 1024-dimensional vector, and we rank results by <strong>cosine similarity</strong> between the query and stored embeddings. Locally, DuckDB&rsquo;s VSS extension handles this with an HNSW index.</p>
<p>In the web app, MotherDuck&rsquo;s WASM client <a href="https://motherduck.com/docs/key-tasks/ai-and-motherduck/text-search-in-motherduck/#embedding-based-search" target="_blank" rel="noopener noreffer">doesn&rsquo;t have VSS</a>, so I compute cosine similarity manually with DuckDB&rsquo;s list functions. I was surprised how well DuckDB handles this without a dedicated vector database, one file for relational data and vectors together.</p>
<p>The &ldquo;graph-boosted search&rdquo; mode multiplies similarity by 1.2x for notes that are also graph-connected. Simple, but it surfaces better results because your link structure encodes intent that embeddings alone miss.</p>
<p>And the hidden connections feature, finding semantically close notes with no direct wikilink, turned out to be the most useful discovery tool.</p>
<p>It found links between notes I&rsquo;d written months apart and never thought to connect.</p>
<h3 id="running-it-on-your-own-vault">Running It on Your Own Vault</h3>
<p>As we constantly add and improve our &ldquo;second brain&rdquo;, this is very powerful, so we can just rerun the ingestion and we get the update.</p>
<p>This is built on my data, but you can use the <a href="https://github.com/sspaeti/obsidian-note-taking-assistant" target="_blank" rel="noopener noreffer">provided GitHub repo</a> and run the local <code>make ingest</code> job to run it on your own Obsidian vault or Markdown files. You&rsquo;ll get the same UI and CLI to ask questions about your notes out of the box.</p>
<p>The results are tailored to our interests, needs, and even notes, as we are the ones who wrote the notes down. Or if you took a lot of highlights via web clippers ReadWise read-it-later, Obsidian Webclipper, also from other authors, but still snippets that you chose to store.</p>
<p>To run it on your own notes, clone the repo, set <code>VAULT_PATH</code> in the <code>.env</code> file to your Obsidian vault (or any folder of Markdown files), and run <code>make ingest</code>.</p>
<p>The ingestion parses all <code>.md</code> files, chunks them, generates embeddings with the BGE-M3 model, and stores everything in a local DuckDB file. From there you have the full CLI with semantic search, backlinks, connections, and hidden link discovery.</p>
<p>If you want the web UI too, sync to MotherDuck with <code>make sync-motherduck</code> and deploy the Next.js app.</p>
<h3 id="the-final-result">The Final Result</h3>
<p>The result of this exercise is two parts with sub-components like this:</p>
<ul>
<li><strong>Ingestion pipeline</strong>: A local job that parses Obsidian markdown, chunks it, and generates embeddings using the BGE-M3 model. Run make ingest and the local DuckDB file is ready to query.</li>
<li><strong>Web app</strong> at <a href="https://explore.ssp.sh" target="_blank" rel="noopener noreffer">explore.ssp.sh</a>, composed of three services:
<ul>
<li><strong>Frontend</strong> on Vercel: Next.js app with MotherDuck WASM client running DuckDB queries directly in the browser.</li>
<li><strong>Database on MotherDuck</strong>: Cloud-hosted DuckDB, synced from local via make sync-motherduck. No server to manage.</li>
<li><strong>Embedding microservice on Railway</strong>: A FastAPI endpoint that hosts the BGE-M3 model and converts search queries into vectors on demand. The browser sends your search text, gets back a 1024-dim embedding, and uses it to query MotherDuck for similar chunks. This avoids running a ~1.8GB model in the browser and sidesteps Hugging Face API rate limits.</li>
</ul>
</li>
</ul>
<p>Here you can see backlinks and hops that go over two notes. The hops are interesting as we don&rsquo;t see this easily on a graph, or it&rsquo;s harder to showcase. That&rsquo;s why I added them besides the normal backlinks and outgoing links.<br>







</p>
<p>Find hidden connections. Here we see that AT Protocol, the protocol behind social media platform Bluesky and others, is connected to Ducklake. Something I wouldn&rsquo;t have associated myself:<br>





</p>
<p>Now we can compare notes, think why this could be, and what&rsquo;s the connection and insight we can gain from it. This is exactly why I built this, to get such insights.</p>
<blockquote>
<p>[!info] Clickable Links</p>
<p>Each note on <a href="https://explore.ssp.sh" target="_blank" rel="noopener noreffer">explore.ssp.sh</a> has a clickable link to my public brain at <code>ssp.sh/brain/[note-name]</code>.</p>
</blockquote>
<h2 id="lessons-learned-ai-agents-for-data-engineers">Lessons Learned: AI Agents for Data Engineers</h2>
<p>As you probably have noticed, since the Christmas break, the AI hype or enthusiasm around agents got very loud. One reason is that many got a good amount of time to actually test the latest. On the other hand, the models got better, and thirdly these AI companies provided new features such as Skills, cowork, and many more.</p>
<p>I myself also took some time and thought about how we can leverage agents for data engineering, especially Claude Code. But contradicting many who say the models got much better, I think the key to the boost of productivity is a different one. With <a href="https://getnao.io/" target="_blank" rel="noopener noreffer">nao</a>, ChatGPT, Claude, and probably others, we have had AI agents and models already for a while, but most powerful at the current moment are the agents in <code>plan mode</code>. It&rsquo;s the key to build longer and have us more in the loop.</p>
<p>But what is &ldquo;Plan Mode&rdquo; you might ask? The definition:</p>
<blockquote>
<p>Claude Plan Mode is a read-only state in Claude Code, an AI coding assistant, that lets it analyze a codebase, ask clarifying questions, and generate detailed implementation plans without making any actual file changes or executing commands, ensuring safety and structure before development begins. It&rsquo;s activated by cycling modes (often Shift+Tab) and is great for exploring, planning complex changes, and building context, allowing developers to approve the AI&rsquo;s strategy before actual coding starts. More on <a href="https://lucumr.pocoo.org/2025/12/17/what-is-plan-mode/" target="_blank" rel="noopener noreffer">What Actually Is Claude Code’s Plan Mode?</a></p>
</blockquote>
<p>With that, it&rsquo;s amazing what you can build. All the open todos we add to our backlog, we can now quickly build and test or solve, and think through the problem by actually laying out the step-by-step instructions. After it&rsquo;s built we get a feel for it quickly and can give better feedback on whatever job we have at hand right now.</p>
<p>Still we need to be careful to not just jump into building every little thing, as we could, because spending hours on something that we don&rsquo;t need is still wasting precious time.</p>
<p>I have experienced it myself often. I get the perception of being super productive, but after a couple of hours, or sometimes days, we actually didn&rsquo;t achieve what we needed. The idea we thought was cool didn&rsquo;t go anywhere, and we are mentally more exhausted because we didn&rsquo;t really do the heavy lifting, meaning we don&rsquo;t really understand what was generated. And potentially also didn&rsquo;t learn anything new.</p>
<p>With that in mind, we need to be careful when to use the new tools, certainly not always, but there are many ways. So how else should we use agents and AI as data engineers and knowledge workers?</p>
<blockquote>
<p>[!note] Plan Mode Support</p>
<p>Besides Claude, Plan Mode is widely adopted across AI coding assistants including <a href="https://cursor.com/blog/plan-mode" target="_blank" rel="noopener noreffer">Cursor</a> (October 2025), <a href="https://windsurf.com/blog/windsurf-wave-10-planning-mode" target="_blank" rel="noopener noreffer">Windsurf</a> (June 2025), <a href="https://github.blog/changelog/2025-11-18-plan-mode-in-github-copilot-now-in-public-preview-in-jetbrains-eclipse-and-xcode/" target="_blank" rel="noopener noreffer">GitHub Copilot</a> (VS Code, Visual Studio, JetBrains, Eclipse, Xcode), <a href="https://docs.lovable.dev/features/plan-mode" target="_blank" rel="noopener noreffer">Lovable</a>, <a href="https://support.bolt.new/best-practices/discussion-mode" target="_blank" rel="noopener noreffer">Bolt.new</a>, and <a href="https://blog.replit.com/introducing-plan-mode-a-safer-way-to-vibe-code" target="_blank" rel="noopener noreffer">Replit</a> (September 2025). Everyone is following a similar pattern of letting AI analyze, ask clarifying questions, and propose structured implementation plans before writing any code.</p>
</blockquote>
<h3 id="plan-mode-and-how-we-work-best-with-ai-agents">Plan Mode: And How We Work Best with AI Agents</h3>
<p>This is how we humans work best as well. We make a plan, and then execute it and adjust along the way. But it&rsquo;s also a great way to work with juniors, and in that sense, AI agents.</p>
<p>Because we say what we want in an abstract manner, the agent says what it would do in a plan form (just a markdown file, markdown runs the world these days), and then we as the <strong>senior, or the designer or architect</strong> can see if it missed our interpretation (as language is not precise), and we work on a great plan with all the details. This way we know it does what we expect it to do. And then it goes off and does it autonomously with access to the terminal and all command line tools.</p>
<p>But there&rsquo;s one more factor, it&rsquo;s the human factor. Whatever it builds, it builds on trained data. So it will use what most people use. Which might be ok for most cases, but maybe not if you want to build something unique, innovative. That&rsquo;s why I think for most writers, it&rsquo;s not the right tool to let it write the stuff for us. Just for that fact, but even more so, the character and soul of the person gets stripped away. The quirky things someone does, which make them who they are, that <strong>takes away from the fun</strong> of writing.</p>
<p>Obviously in coding, this is not the same. Except if you are another programmer and need to read the code, no? Because any data engineer would love to read the code from a human rather than an AI, it&rsquo;s kind of boring. But maybe it just needs to do the job, and not all human code is beautiful too, right?</p>
<blockquote>
<p>[!note] See the Prompt for the Web App</p>
<p>If you want to know how I built the web app without having experience in Next.js, I am sharing the <a href="https://github.com/sspaeti/obsidian-note-taking-assistant/blob/main/web-app/prompts/agents-webapp.md" target="_blank" rel="noopener noreffer">initial prompt</a> with plan mode that could be interesting. The summary of the full session (ca. 3-4 hours) is at <a href="https://github.com/sspaeti/obsidian-note-taking-assistant/blob/main/web-app/prompts/build-summary.md" target="_blank" rel="noopener noreffer">build-summary.md</a>.</p>
</blockquote>
<h3 id="where-are-we-heading">Where Are We Heading?</h3>
<p>So what about data engineering? Where are we today?</p>
<p>As I have written extensively about at <a href="https://www.rilldata.com/blog/has-self-serve-bi-finally-arrived-thanks-to-ai?" target="_blank" rel="noopener noreffer">Self-serve BI thanks to AI</a> or using it for <a href="https://www.rilldata.com/blog/data-modeling-for-the-agentic-era-semantics-speed-and-stewardship" target="_blank" rel="noopener noreffer">data modeling along with semantics, speed, and stewardship</a>, humans still need to be in the loop, and we need to be careful to not generate too much (ingestion logic, business logic, general code, or dashboards) that is unmaintainable or never needed in the first place.</p>
<p>On the other hand, there&rsquo;s no definitive answer right now, we are all just figuring it out. That&rsquo;s why some say it&rsquo;s the most exciting times, because everything is supposedly going to change. <a href="https://x.com/karpathy/status/2004607146781278521" target="_blank" rel="noopener noreffer">Andrej Karpathy</a> said:</p>
<blockquote>
<p>Clearly some powerful alien tool was handed around except it comes with no manual and everyone has to figure out how to hold it and operate it, while the resulting magnitude 9 earthquake is rocking the profession.</p>
</blockquote>
<p>As a writer but also data engineer, I find it most useful when it suggests notes and ideas I have forgotten about that are relevant to my current task at hand. Or a <strong>snippet of code</strong>.</p>
<h3 id="repeating-code-snippets-over-and-over">Repeating Code Snippets over and over</h3>
<p>How many times have we written an ingestion pipeline that does the same thing just for a different source? Written an incremental update pipeline, or a full load, or implemented Slowly Changing Dimensions (Type 2).</p>
<p>Wouldn&rsquo;t it be great to have a tool that helps us remember and suggest code that worked for a problem at hand? No wonder Windows has a built-in <a href="https://support.microsoft.com/en-us/windows/retrace-your-steps-with-recall-aa03f8a0-a78b-4b3e-b0a1-2eb8ac48701c" target="_blank" rel="noopener noreffer">Windows Recall</a> feature that takes snapshots of everything we do, so we can see and remember what we did. Google traces where we went on <a href="https://www.google.com/maps/timeline" target="_blank" rel="noopener noreffer">Google Maps Timeline</a>, and so on. Not saying all of these are good, but clearly there&rsquo;s a need for it.</p>
<h3 id="vibe-coding">Vibe Coding</h3>
<p>Mostly these tasks are called <strong>vibe coding</strong> these days. I believe that vibe coding is best when you have an existing framework present and it can extend it. E.g. your website skeleton that already has a pre-existing structure is much better than starting from scratch, especially maintainability-wise.</p>
<p>Also, the more it has to predict in the future, the more likely it will introduce errors, compared to you providing a big skeleton with all the needed files and just extending on functionality.</p>
<p>This is the same for data engineering too. Declarative Data Stack, YAML Engineer is exactly that. A well-designed YAML that has a powerful system in the backend can go a long way with an agentic and vibe-coded approach.</p>
<p>It&rsquo;s similar to <a href="https://martinfowler.com/articles/exploring-gen-ai/sdd-3-tools.html" target="_blank" rel="noopener noreffer">Spec Driven Development (SDD)</a>, which is when we write our instructions in <code>claude.md</code> and Claude or any AI agents implement this. Also what <a href="https://www.linkedin.com/posts/escoo_ive-been-writing-99-of-my-code-at-airbnb-activity-7419777912096120832-f4fh?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAABkA2pgBYM4xDO0z2ChYuxFhBfu4h7jp4Lo" target="_blank" rel="noopener noreffer">Esco Obong</a> said about what they do at Airbnb: the hard part is coming up with the spec, talking to business, etc. The coding part is the small part.</p>
<p>And this is also where the human is still dearly needed in my opinion. Human in the seat and config-driven development is what it comes down to with AI agents. Plus, AI models have a context limit. Sure, we humans do too, but we can think more across domains and understand intuitive things that might not work for a statistical model.</p>
<p>This shows how that works, and why Markdown is in the middle of everything. Not only for the knowledge, but also to build and develop things.</p>
<blockquote>
<p>[!tip] Lifehack for Prompting</p>
<p>Always keep it simple, because <strong>it&rsquo;s easy to make it complex</strong>. The true beauty lies in making it simple, which is something agents are not good at.</p>
</blockquote>
<h3 id="use-mcp">Use MCP</h3>
<p>A key was using MotherDuck MCP with a direct connection from Claude Code to the database while prompting the initial version. Claude could directly query the database and its columns to implement the actual web app (see the initial prompt <a href="https://github.com/sspaeti/obsidian-note-taking-assistant/blob/main/web-app/prompts/agents-webapp.md" target="_blank" rel="noopener noreffer">here</a>).</p>
<p>Meaning Claude (in my case) could just query the database, use <code>SHOW TABLES</code>, select them, and extract their data types. And more, learning about the content and graph relationships that I had built in the first part.</p>
<p>So Claude could easily build a first version based on my instructions and existing DuckDB database. I also shared the great docs to build <a href="https://motherduck.com/docs/key-tasks/customer-facing-analytics/3-tier-cfa-guide/" target="_blank" rel="noopener noreffer">Customer-Facing Analytics Guide in a (3-tier Architecture)</a>.</p>
<p>With that, I almost had my web app ready with a single <code>plan mode</code> prompt.</p>
<blockquote>
<p>[!example] Claude supports LSP now</p>
<p>As code editors do, Claude also supports LSP (Language Server Protocol). This helps speed up Claude to read the code more efficiently, doing lookups by jumping to references or definitions instead of searching its way through the code. It might also understand the code better as it has a language server to use.</p>
</blockquote>
<h2 id="conclusion">Conclusion</h2>
<p>Building this tool reminded me again how powerful DuckDB and MotherDuck are. It&rsquo;s a Swiss Army knife database that can handle unique tasks and simplify my note-taking by providing a serverless database for querying my embeddings.</p>
<p>Now I have a powerful tool to search for related notes when I need to solve a problem, or to find relevant notes in my own second brain. The hidden connections this tool surfaces are valuable only because they&rsquo;re my connections, my thinking, not just crawled information on the internet. And not only that, I can even provide a minimal but useful web app for you to search my public notes, too.</p>
<p>As for the AI agents that helped build it: they got me there faster, but only because I stayed in the loop. Let them run without direction, and you&rsquo;ll get a thousand lines solving the wrong problem. To me, the &ldquo;human&rdquo; architect is still needed.</p>
<hr>
<p><strong>Other implementations</strong> I have collected over the years or came across while building this that might be helpful if you want to build something similar.</p>
<p>If you have many more files and embeddings that need to be created, follow the <a href="https://blog.brunk.io/posts/similarity-search-with-duckdb/" target="_blank" rel="noopener noreffer">Using DuckDB for Embeddings and Vector Search</a> article that runs on the GPU, creating embeddings for 2.85M Wikipedia articles. He used the Arrow/GPU acceleration and batch inserts via Arrow.</p>
<p>Some more links and repos I found interesting:</p>
<ul>
<li><strong>Scalable Embeddings &amp; Vector Search</strong>
<ul>
<li><a href="https://blog.brunk.io/posts/similarity-search-with-duckdb/" target="_blank" rel="noopener noreffer">Using DuckDB for Embeddings and Vector Search</a>: Tutorial on GPU-accelerated vector search that created embeddings for 2.85M Wikipedia articles using Arrow batch inserts and HNSW indexing.</li>
</ul>
</li>
<li><strong>Local-First Search Tools for Markdown</strong>
<ul>
<li><a href="https://github.com/tobi/qmd" target="_blank" rel="noopener noreffer">qmd</a>: Tobias Lütke&rsquo;s CLI search engine combining BM25, vector search, and LLM re-ranking—all local via Ollama, works with plain markdown (no wikilinks needed).</li>
</ul>
</li>
<li><strong>Obsidian AI Assistants</strong>
<ul>
<li><a href="https://github.com/logancyang/obsidian-copilot" target="_blank" rel="noopener noreffer">Obsidian Copilot</a>: A popular Obsidian AI plugin (6.1k+ stars) with vault chat, agent mode, and image/PDF/web processing—no index required for basic search.</li>
<li><a href="https://www.youtube.com/watch?v=NSoKRYNlOls" target="_blank" rel="noopener noreffer">Chat with Your ENTIRE Obsidian Vault OFFLINE (YouTube)</a>: Video walkthrough of offline Obsidian vault chat with Claude 3 integration.</li>
</ul>
</li>
<li><strong>RAG Frameworks &amp; Libraries</strong>
<ul>
<li><a href="https://github.com/QuivrHQ/quivr" target="_blank" rel="noopener noreffer">Quivr</a>: YC-backed opinionated RAG framework (38.6k+ stars) supporting any LLM, any vectorstore, and any file type with YAML-configured workflows.</li>
<li><a href="https://github.com/traversaal-ai/lennyhub-rag" target="_blank" rel="noopener noreffer">LennyHub RAG</a>: Complete RAG implementation on 297 podcast transcripts with knowledge graph extraction, Qdrant storage, and interactive network visualization.</li>
</ul>
</li>
<li><strong>AI-Assisted Development in Production</strong>
<ul>
<li><a href="https://www.linkedin.com/posts/escoo_ive-been-writing-99-of-my-code-at-airbnb-activity-7419777912096120832-f4fh" target="_blank" rel="noopener noreffer">Esco Obong on AI Coding at Airbnb (LinkedIn)</a>: Airbnb engineer shares writing 99% of code with LLMs, noting that code is &ldquo;only a small part of the actual work.&rdquo;</li>
</ul>
</li>
<li><strong>My List of Obsidian Related RAGs</strong>: <a href="https://www.ssp.sh/brain/second-brain-assistant-with-obsidian-notegpt" target="_blank" rel="noopener noreffer">Second Brain Assistant with Obsidian</a></li>
</ul>
<hr>
<pre class=""><em>Full article published at <a href="https://motherduck.com/blog/obsidian-rag-duckdb-motherduck/" target="_blank" rel="noopener noreferrer">MotherDuck.com</a> - written as part of <a href="/services">my services</a></em></pre>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Just a Really Very Intelligent System from Iron Man&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>Also check out <a href="https://www.spicytakes.org/" target="_blank" rel="noopener noreffer">Spicy Takes</a> with lots of quotes from popular blogs, that get rated by their spiciness.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></description>
</item>
<item>
    <title>Arch Linux (Omarchy) — 8 Months Later: The Good, the Bad, and the Fixable</title>
    <link>https://www.ssp.sh/blog/linux-omarchy-the-good-bad-and-fixable/</link>
    <pubDate>Tue, 10 Feb 2026 21:31:17 &#43;0100</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/linux-omarchy-the-good-bad-and-fixable/</guid><enclosure url="https://www.ssp.sh/blog/linux-omarchy-the-good-bad-and-fixable/featured-image.png" type="image/png" length="0" /><description><![CDATA[<p>This is a follow-up to my part 1 of <a href="https://www.ssp.sh/blog/macbook-to-arch-linux-omarchy/" target="_blank" rel="noopener noreffer">Switching macOS to Arch Linux with Omarchy</a>, where I documented my first months with Arch Linux and [[Omarchy]], after switching from 15 years of using macOS and Windows on and off at work since 2003.</p>
<p>Back then, I had a checklist of basics I needed before I could commit to Linux as a daily driver: Obsidian, a Raycast-like launcher for fuzzy finding files and folders, screenshots (Snagit), daylight adjustment (f.lux), calendar events in the top bar. Those were quick wins.</p>
<p>Eight months later, I&rsquo;ve gone through many more challenges and learnings. In this post, I&rsquo;ll share which apps replaced my heavily integrated <a href="https://www.youtube.com/watch?v=sStKFOwNaSM" target="_blank" rel="noopener noreffer">macOS workflow</a>, what my <a href="https://www.youtube.com/watch?v=XOp8lngtmPg" target="_blank" rel="noopener noreffer">Omarchy workflow</a> looks like now, and — honestly — what still doesn&rsquo;t quite work.</p>
<h2 id="apps-that-replaced-my-macos-apps-on-linux">Apps that Replaced My macOS Apps on Linux</h2>
<p>Let&rsquo;s start with which apps and how I changed some of my workflow now in Linux.</p>
<p>Below list goes from complex Raycast replacement that was integrated into my whole workflow with search through files, calculator, emojis to calendar, daylight gamma correction for night sessions to PDF viewer that replaces Finder to sharing screen with Linux window picker, and much more.</p>
<p>It continues with running Windows on Linux with a simple install toggle and finding the right hardware, before I create a conclusion of these initial months using Linux full time for my business and also privately.</p>
<h3 id="app-launcher-and-raycast-replacement-fuzzy-search-file-search-clipboard-math-and-so-on">App Launcher and Raycast Replacement: Fuzzy Search, File Search, Clipboard, Math, and so on</h3>
<p>One of the first apps to replace that most have, and that I also used, is <strong>Raycast</strong>. It&rsquo;s an app I couldn&rsquo;t live without, not only for the fuzzy finder but also for quick calculations, searching files, and clipboard manager.</p>
<p>With <strong>[[Walker Launcher]]</strong> I found the perfect replacement which has this all included and works like a charm.</p>













  
<figure><a target="_blank" href="/blog/linux-omarchy-the-good-bad-and-fixable/img_Walker_launcher_1760944046142.webp" title="">

</a><figcaption class="image-caption">Functions of Walker available with <code>/</code> | See my <a href="https://x.com/sspaeti/status/1979916427583742344" target="_blank" rel="noopener noreffer">Tweet</a> for more information.</figcaption>
</figure>
<p><strong>Search file content</strong> with spotlight - Find files with Walker with built-in preview: ![[img_Switched from macOS to Linux- 6 months in_1770740249453.webp]]</p>
<p>Opening its containing folder or file with the default program. This is how I search and find anything compared to manually browsing through file explorer. Find any files within seconds with built-in search of Walker (Before I found Walker, I built <a href="https://github.com/sspaeti/dotfiles/blob/master/hypr/.config/hypr/sspaeti/fuzzy-file-content.sh" target="_blank" rel="noopener noreffer">my own one</a>).</p>
<p><strong>Emojis</strong> quick search. It comes with Walker built-in too, but I have my own script so I can find emojis faster as I can change the search term. Very <a href="https://github.com/sspaeti/dotfiles/blob/master/hypr/.config/hypr/sspaeti/emoji-fuzzy.sh" target="_blank" rel="noopener noreffer">simple, but powerful</a><br>
![[img_Switched from macOS to Linux- 6 months in_1770741956438.webp]]</p>
<p><strong>Clipboard managers</strong>, of which there <a href="https://github.com/savedra1/clipse" target="_blank" rel="noopener noreffer">are</a> <a href="https://github.com/sentriz/cliphist" target="_blank" rel="noopener noreffer">several</a>, but Walker comes with one built-in too. Including <strong>search</strong> and <strong>image preview</strong>:</p>













  
<figure><a target="_blank" href="/blog/linux-omarchy-the-good-bad-and-fixable/img_Switched%20from%20macOS%20to%20Linux-%206%20months%20in_1770742102558.webp" title="">

</a><figcaption class="image-caption">Clipboard on opening, with search and image preview.</figcaption>
</figure>
<p>Other dedicated clipboard managers are <a href="https://github.com/sentriz/cliphist" target="_blank" rel="noopener noreffer">cliphist</a> or <a href="https://github.com/savedra1/clipse" target="_blank" rel="noopener noreffer">Clipse</a>. There are also other Raycast-compatible launchers for Linux such as <a href="https://github.com/ByteAtATime/flare" target="_blank" rel="noopener noreffer">flare</a>, Rofi, and many more.</p>
<h3 id="keyboard-shortcuts-and-quick-symbols">Keyboard Shortcuts and Quick Symbols</h3>
<p>I used <a href="https://github.com/jtroo/kanata" target="_blank" rel="noopener noreffer">Kanata</a> for integration of advanced features to switch between my keyboards and some of the advanced use cases such as using CAPS LOCK for vim-like movements. I use <code>caps + hjkl</code> to move left, down, up and right with the respective arrow keys as almost all programs work with arrow keys. Also F1-F12 functions with <code>caps+1</code> for F1.</p>
<p>For simple replacements, I used XCompose to write Umlauts (<code>äöü</code> and special symbols <code>—«»</code> and more). I used Karabiner-Elements heavily, and Kanata solved it for me, see my configs at <a href="https://github.com/sspaeti/dotfiles/blob/master/kanata/.config/kanata/kinesis.kbd" target="_blank" rel="noopener noreffer">dotfiles.ssp.sh/kanata</a>.</p>
<h3 id="backups-and-data-sync">Backups and Data Sync</h3>
<p>Time machine on macOS was great. I used sync.com for dropbox-like sync on macOS too. Neither worked on Linux. So I switched to <a href="https://filen.io/" target="_blank" rel="noopener noreffer">Filen</a>, which has a similar setup and stores the data encrypted, and hosted in Germany. I&rsquo;m using Stow for all my dotfiles stored in Git. It&rsquo;s great, check them at <a href="https://dotfiles.ssp.sh" target="_blank" rel="noopener noreffer">dotfiles.ssp.sh</a>.</p>
<p>I back up my images, personal documents, or scripts also with rsync-scripts to save to my homeserver and encrypted drive on Vultr. See more <a href="/blog/self-host-self-independence/" rel="">Tech Independence</a>.</p>
<p>I also looked at NextCloud for hosting it myself, but for now I just need something that works. As Filen is an Electron app, it just works everywhere.</p>
<h3 id="calendar">Calendar</h3>
<p>Calendar is one thing everyone uses, and I used Cron Calendar (later acquired by Notion) a lot, and wanted a good replacement for Linux. Though I use <a href="https://calendar.google.com" target="_blank" rel="noopener noreffer">calendar.google.com</a> often on the web.</p>
<p>But the best replacement I found was <a href="https://morgen.so/" target="_blank" rel="noopener noreffer">Morgen</a> (built in Switzerland) and is made for Linux first. It has a great preview inside the top bar too and timezones built-in.</p>
<p>![[img_Switched from macOS to Linux- 6 months in_1770740582450.webp]]</p>
<p>Time zones can also be activated by hovering on the time on the left:<br>
![[img_Switched from macOS to Linux- 6 months in_1770740650778.webp]]</p>
<h3 id="daylight-and-gamma-light-adjustment">Daylight and Gamma light Adjustment</h3>
<p>Sunlight adjustment like <a href="https://justgetflux.com/" target="_blank" rel="noopener noreffer">f:lux</a>. Omarchy comes with one included right now, but I also used <code>wlsunset</code> with <code>wlsunset -l 47.4095 -L 8.5514 -t 3500 -T 6500</code>, that does the job well.</p>
<h3 id="hibernation-and-suspending-computer">Hibernation and Suspending Computer</h3>
<p>Hibernation and suspending is something that you take for granted on other operating systems. But on Linux it&rsquo;s trickier, so it didn&rsquo;t work out of the box. In the meantime, it comes built into Omarchy.</p>
<h3 id="presenting-with-external-projectors-and-screens">Presenting with External Projectors and Screens</h3>
<p>Presentations and recognition of presenters and screens. I only had one presentation, but tried many external monitors, and Hyprland (which is responsible for recognizing screens <code>hyprctl monitors</code>) works just like macOS by auto-recognizing them.</p>
<p>Even better, I have shortcuts to make them automatically align at the right position. Or use <a href="https://github.com/erans/hyprmon" target="_blank" rel="noopener noreffer">hyprmon</a> (one of the great [[TUIs]]) when I need to do it manually.</p>
<h3 id="pdf-merger">PDF Merger</h3>
<p><a href="https://github.com/pdfarranger/pdfarranger" target="_blank" rel="noopener noreffer">PDF Arranger</a> for merging multiple PDFs into one or rotating pages of a PDF. It&rsquo;s open source and better than macOS Preview.</p>
<h3 id="need-anything-more-just-build-vibe-code-it-yourself">Need Anything More? Just Build (vibe code) it Yourself</h3>
<p>If you need something, you just build it with [[Claude Code]] and integrate it into your laptop.</p>
<p>No need to ask Mr. Bill Gates or Tim Cook to integrate it. For example, I needed an <strong>edge light for video calls</strong>, or saw someone who had this. I liked it, not that I really needed it (but one day it might be helpful 😀). I&rsquo;ve built a <a href="https://github.com/sspaeti/wayland-edge-light-videocalls" target="_blank" rel="noopener noreffer">small custom tool</a> that works out of the box with Hyprland for my future video calls.</p>













  
<figure><a target="_blank" href="/blog/linux-omarchy-the-good-bad-and-fixable/img_Switched%20from%20macOS%20to%20Linux-%206%20months%20in_1770735368009.webp" title="">

</a><figcaption class="image-caption">Check it out at <a href="https://github.com/sspaeti/wayland-edge-light-videocalls" target="_blank" rel="noopener noreffer">wayland-edge-light-videocalls</a></figcaption>
</figure>
<h3 id="screen-sharing-works-differently">Screen Sharing Works Differently</h3>
<p>For example, screen sharing is not as straightforward because you get a very old frame to pick your windows or output or regions. On top of that, you usually need to pick twice and only the second pick will count. This was very confusing and I documented a fix and how it looks at [[Screen Sharing on Wayland (hyprland) with Chrome]].</p>
<p>But with the latest updates of Omarchy, that has also been solved and it works out of the box and looks beautiful now:</p>













  
<figure><a target="_blank" href="/blog/linux-omarchy-the-good-bad-and-fixable/img_Switched%20from%20macOS%20to%20Linux-%206%20months%20in_1770754480836.webp" title="">

</a><figcaption class="image-caption">Compared to the default DOS screen picker, this is beautiful, or just modern.</figcaption>
</figure>
<h3 id="others-virtual-envs-remote-desktop-and-adding-printers">Others: Virtual Envs, Remote Desktop and Adding Printers</h3>
<p>For <strong>virtual environments</strong>, I&rsquo;m using Mise, as it comes pre-installed on Omarchy. Before, I used <code>asdf</code>.</p>
<p>Remote desktop to virtual desktops works great with <code>xfreerdp3</code>, which connects well to the Windows VM.</p>
<p>Need to import images from camera? Not as UI-driven as on Mac or Windows, but amazingly simple and fast with gphoto, see [[Import Files on Arch Linux (gphoto)]].</p>
<p>Adding printers might be needed at some point. This can be done UI-driven with system-config-printer - CUPS configuration tool. Or do it the terminal way with <code>lpadmin</code>, see [[Adding Printer on Linux]] for more information.</p>
<h2 id="running-microsoft-windows-inside-linux">Running Microsoft Windows inside Linux</h2>
<p>A big one is to run another operating system, in this case Windows, as part of your OS. E.g. I use Microsoft Office often, so I can quickly start up Windows with Office when needed.</p>
<p>The best part is that it uses only Docker, meaning easy setup, separated from my configs. It&rsquo;s a single-click setup taking 15 seconds.</p>
<p>Installing and integrating seamlessly in a Docker VM works <a href="https://learn.omacom.io/2/the-omarchy-manual/100/windows-vm" target="_blank" rel="noopener noreffer">superbly with Omarchy</a>. I submitted a <a href="https://github.com/basecamp/omarchy/pull/1333" target="_blank" rel="noopener noreffer">PR to Omarchy</a> to make this available to everyone. The built-in version in Omarchy (the first version) was done by me, and it was merged into core and is now available to everyone.</p>













  
<figure><a target="_blank" href="/blog/linux-omarchy-the-good-bad-and-fixable/windows-omarchy-vm.webp" title="">

</a><figcaption class="image-caption"><a href="https://x.com/sspaeti/status/1978823118270390642" target="_blank" rel="noopener noreffer">Tweet</a> and thanks from <a href="https://x.com/dhh/status/1978826791792918724" target="_blank" rel="noopener noreffer">DHH</a> himself.</figcaption>
</figure>
<p>It&rsquo;s now easier to run Windows on Linux than natively on a Windows machine 😉.</p>
<div class="details admonition info open">
        <div class="details-summary admonition-title "><i class="icon admonition-icon icon-info"></i>Many ways of integrating: Omarchy uses Dockur<i class="details-icon  admonition-icon admonition-icon-arrow-right"></i></div>
        <div class="details-content">
            <div class="admonition-content"><p>There are many options:</p>
<ul>
<li><a href="https://github.com/dockur/windows" target="_blank" rel="noopener noreffer">dockur/windows</a>: Windows inside a Docker container (used in Omarchy).</li>
<li><a href="https://github.com/winapps-org/winapps" target="_blank" rel="noopener noreffer">Winapps</a>: Run Windows apps such as Microsoft Office/Adobe in Linux (Ubuntu/Fedora) and GNOME/KDE as if they were a part of the native OS, including Nautilus integration.</li>
<li><a href="http://winboat.app/" target="_blank" rel="noopener noreffer">WinBoat</a>: an easier version than Winapps:  - Run Windows Apps on Linux with Seamless Integration.</li>
</ul></div>
        </div>
    </div>
<h2 id="finding-the-right-hardware-the-reasons-why-not-to-switch">Finding the Right Hardware: The Reasons why not to Switch</h2>
<p>Everyone knows the stereotypes about Linux. WiFi won&rsquo;t work, Bluetooth won&rsquo;t connect, constant interruptions. And beyond that, there&rsquo;s the hardware fear, that you simply can&rsquo;t match what Apple offers. A common sentiment:</p>
<blockquote>
<p>I&rsquo;m currently with this dilemma. I&rsquo;m an experienced Linux user, but over the years gravitated towards Macs (especially M-series) and unfortunately they do make better hardware, at least for my use 🙈 I just <em>can&rsquo;t</em> move to a machine with a much worse battery life, display, webcam, speakers etc. I know some good Linux-friendly laptops exist, but it&rsquo;s still a downgrade, for me. If someone made better hardware, I&rsquo;d probably jump over right away. <a href="https://x.com/DenLoginoff/status/2021079777608614290" target="_blank" rel="noopener noreffer">Tweet</a></p>
</blockquote>
<p>I thought the same. The great keyboard, camera, speakers, trackpad, battery. Apple just nails the whole package. But what I found is that I didn&rsquo;t actually have to downgrade.</p>
<p>I started with a <strong>Lenovo ThinkBook 14 G7 ARP (AMD)</strong> with 32 GB RAM. Great build quality, beautiful look, and the keyboard surprised me, with much more travel and grip than the MacBook. See more on <a href="https://www.ssp.sh/blog/macbook-to-arch-linux-omarchy/#choosing-the-hardware" target="_blank" rel="noopener noreffer">Part 1</a>.</p>
<p>Once I realized this would become my daily driver, I searched for something more powerful for data engineering work and landed on a <strong>Tuxedo InfinityBook Pro 14 Gen10 AMD</strong> with 128 GB (!!) RAM, an AMD Ryzen AI 9 HX 370, and AMD Radeon 890M.</p>













  
<figure><a target="_blank" href="/blog/linux-omarchy-the-good-bad-and-fixable/img_Switched%20from%20macOS%20to%20Linux-%206%20months%20in_1770739390246.webp" title="">

</a><figcaption class="image-caption">My Tuxedo InfinityBook Pro 14 Gen10 AMD, with 128 GB (!!) RAM, AMD Ryzen AI 9 HX 370 , and AMD Radeon 890M.</figcaption>
</figure>
<p>First impressions: super smooth, even snappier than the Lenovo. Obsidian and other apps feel a tiny bit faster. The 3K 500-nit display is stunning. Crisp, bright, better than my external 4K monitor. And it&rsquo;s <strong>matte</strong>, which I&rsquo;d forgotten I actually prefer. It works outside, no glare. The Lenovo&rsquo;s anti-glare screen was equally great in that regard.</p>
<p>The keyboard has less travel than the Lenovo ThinkBook (which I really loved), more like a MacBook, which is fine but feels slightly cheap. I mostly use external keyboards anyway. The fingerprint reader is also missing, which I&rsquo;d grown to love on both MacBooks and the Lenovo, where it worked flawlessly on Linux. The trackpad is smooth and great to work with daily, though palm detection caused some cursor jumping in the first days, not as good as Apple&rsquo;s, but perfectly usable.</p>
<p>Battery life was a pleasant surprise. My Tuxedo (80Wh) delivered battery life comparable to my M1 Max MacBook. I spent a whole afternoon in the library and it was still above 70%.</p>
<p>With Omarchy, everything just worked out of the box. No WiFi or Bluetooth issues, speakers with sound, all good.</p>
<p>But it&rsquo;s not perfect, by far. I get some <a href="https://www.reddit.com/r/tuxedocomputers/comments/17pzcet/strange_popping_sounds_coming_from_my_laptop/" target="_blank" rel="noopener noreffer">strange popping sounds from the laptop</a>, mostly after hibernating once or twice. Not sure why, and it probably shouldn&rsquo;t happen. There are many other [[Notebook &amp; Desktops for Linux]] to choose from, Framework being one, but choosing is still tricky, as chipset and GPU support on Linux matters, and you want something state-of-the-art.</p>
<p>Another side effect, as <a href="https://x.com/KevinNaughtonJr/status/2021009900097483120" target="_blank" rel="noopener noreffer">Kevin says</a> of not having an expensive Macbook:</p>
<blockquote>
<p>My favorite part aside from customization is just that i don&rsquo;t care about my machine at all: it gets lost? breaks? stolen? i get a new machine, run 1 command and everything is back exactly as i left it. Macs on the other hand are expensive to buy and repair which makes people worry and worry = less peace of mind.</p>
</blockquote>
<blockquote>
<p>[!example] Follow the evolution on Social Media</p>
<p>The whole story I documented in threads on <a href="https://x.com/sspaeti/status/1942502383923134464" target="_blank" rel="noopener noreffer">Twitter</a> and on <a href="https://bsky.app/profile/ssp.sh/post/3lug5oijnjc22" target="_blank" rel="noopener noreffer">Bluesky</a>, follow these to see the history and events in they happened.</p>
</blockquote>
<h2 id="conclusion-of-using-linux-for-8-months">Conclusion of Using Linux for 8+ Months</h2>
<p>After using Windows since 2003 and macOS for more than 15 years, how do I feel after 8 months on Linux?</p>
<p><strong>Things mostly work great</strong>, but need a little tinkering to begin with, or work differently. The biggest difference, which I like a lot, is a more terminal-native workflow. Closer to the command line. Using lots of [[TUIs]].</p>
<p><strong>When I started</strong>, I just wanted the same as I had on macOS. After getting familiar with the new environment, with all the small utilities, tools and programs Linux has, I got many more tools to choose from. Sometimes much better, though terminal-based, but fast and direct. Sometimes you obviously miss a tool that has no replacement (for me still Snagit).</p>
<p>Besides the obvious (terminal-native, best-in-class Tiling Window Manager with Hyprland, no-latency navigation), there&rsquo;s something harder to put into words. The OS is what we use every day, so when you can quickly fix or change a small thing to give you more joy or more productivity, it might just put a smile on your face whenever you use that feature. At least it still does for me. And since all my configs live in <a href="https://dotfiles.ssp.sh" target="_blank" rel="noopener noreffer">dotfiles</a> and my data syncs externally via Filen and Obsidian, setting up a new machine is a single command.</p>
<h3 id="what-i-thought-id-miss-vs-what-i-actually-miss">What I Thought I&rsquo;d Miss vs. What I Actually Miss</h3>
<p>Before I switched, I thought I&rsquo;d miss all my <a href="https://setapp.com/" target="_blank" rel="noopener noreffer">Setapp</a>, my MacBook hardware, and the stability of just working. Most of my apps work on my new machine and even better software-wise, so I&rsquo;m still quite happy to have made the switch, even more so watching macOS get slower with each install without real benefit (looking at the Liquid Glass update) and Windows stuffing Copilot into Notepad and recording your screen with Recall.</p>
<p>What I <strong>actually</strong> miss are simpler things: the <strong>stability</strong> of having calls everywhere all the time with Apple reliability and inbuilt mic/speaker/camera. A crash because the GPU is fails<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> after hibernation right before an important meeting. Hibernation and suspending was quite a battle to get working, but it seems to just work now.</p>
<p>What I like about Linux: it might not work out of the box for every laptop or every program, but you can actually fix it, and from that moment you know the problem, you learned something about computers, and the error will not appear again. Unlike other operating systems that change stuff you set in settings for a reason, only to learn that certain updates turned that checkmark back on.</p>
<h3 id="tinkering-and-troubleshooting-not-everything-just-works">Tinkering and Troubleshooting: Not Everything Just Works</h3>
<p>Probably, without [[Claude Code]], I wouldn&rsquo;t have made the switch, or I would have made it, but probably wouldn&rsquo;t have stayed. When something happens, e.g., a crash out of nowhere, I just open Claude and say: I had a crash, I am running Arch Linux, please check the logs what went wrong. And what I get is a full analysis of what went wrong, some fixes and suggestions. Knowing that Linux has 100 different setups, different drivers for every hardware, this is a non-negotiable lifesaver.</p>
<p>With Linux you also have to troubleshoot bugs, but at least it&rsquo;s free software and open source, and honestly, they seem even less frequent than with commercial, paid products these days.</p>
<p>If you&rsquo;ve read this far, thank you. What&rsquo;s your experience, are you thinking about switching, or already on Linux? Let me know anywhere on <a href="https://bsky.app/profile/ssp.sh/post/3melam6gxrf2m" target="_blank" rel="noopener noreffer">Bluesky</a>, or <a href="https://x.com/sspaeti/status/2021528330324086934" target="_blank" rel="noopener noreffer">Twitter</a>.</p>
<hr>
<p>Again, if you want to watch a full video workflow, check my short video about it: <a href="https://www.youtube.com/watch?v=XOp8lngtmPg" target="_blank" rel="noopener noreffer">Omarchy Arch Tiling Window Workflow (macOS comparison) - YouTube</a>. Or <a href="https://www.youtube.com/watch?v=sStKFOwNaSM" target="_blank" rel="noopener noreffer">my macOS workflow</a> as a comparison and what I switched from.</p>
<h2 id="appendix-troubleshooting-and-things-that-didnt-work-so-well-or-i-had-already-fixed">Appendix: Troubleshooting and Things That Didn&rsquo;t Work So Well, or I Had Already Fixed</h2>
<p><strong>GPU Crashes (AMD Radeon 890M).</strong> This was the biggest recurring issue. The GPU&rsquo;s MES (Micro Engine Scheduler) would become unresponsive and crash the entire system, triggered by Brave browser, Google Meet video calls, and Kdenlive video encoding. The root cause is that the Radeon 890M (gfx1150/RDNA 3.5) is still very new, and driver support on bleeding-edge kernels (6.17–6.18) is immature. Solutions included disabling hardware acceleration in Brave (<code>brave://settings/system</code>), adding kernel parameters (<code>amdgpu.gpu_recovery=1 amdgpu.noretry=0 amdgpu.ip_block_mask=0xfffff7ff</code>), and considering the LTS kernel as fallback. The community is tracking this on <a href="https://community.frame.work/t/amd-gpu-mes-timeouts-causing-system-hangs-on-framework-laptop-13-amd-ai-300-series/71364" target="_blank" rel="noopener noreffer">Framework forums</a> and <a href="https://gitlab.freedesktop.org/drm/amd/-/issues/3067" target="_blank" rel="noopener noreffer">AMD&rsquo;s GitLab</a>.</p>
<p><strong>Keyboard Freezing After Suspend (Tuxedo).</strong> The internal keyboard would stop working after suspend/resume cycles due to a firmware bug in the keyboard controller (i8042). Fixed by adding <code>i8042.nomux=1 i8042.reset=1 i8042.noloop=1 i8042.nopnp=1</code> to the kernel command line in <code>/etc/default/limine</code> and regenerating the UKI with <code>sudo mkinitcpio -P</code>. Shared the solution on <a href="https://sh.reddit.com/r/tuxedocomputers/comments/1ndq7vw/comment/ne5kjob/" target="_blank" rel="noopener noreffer">Reddit</a>.</p>
<p><strong>Hibernation Not Resuming.</strong> After suspend-then-hibernate (triggered by closing the lid), the laptop wouldn&rsquo;t resume and required a fresh boot. The cause was missing <code>resume=</code> and <code>resume_offset=</code> kernel parameters. Omarchy&rsquo;s hibernation setup script added the mkinitcpio hook but never added the actual kernel parameters. Fixed by calculating the swap offset (<code>sudo btrfs inspect-internal map-swapfile -r /swap/swapfile</code>) and adding <code>resume=/dev/mapper/root resume_offset=&lt;offset&gt;</code> to <code>/etc/default/limine</code>. Documented the fix in <a href="https://github.com/basecamp/omarchy/issues/4259#issuecomment-3804954054" target="_blank" rel="noopener noreffer">this Omarchy issue</a>.</p>
<p><strong>Thermal Throttling (Lenovo).</strong> The Lenovo ThinkBook would hit 99°C and become unusable during video calls. Turned out the bottom intake vents were blocked when the laptop sat flat on a desk. Simply elevating the laptop dropped temps to 73–77°C and performance was completely fine, even running stress tests while screen sharing. A laptop stand solved it permanently.</p>
<p><strong>WiFi Speed Drops (Tuxedo, Intel AX210).</strong> Speeds dropped to 2–72 Mbps after a system update. Root cause was a bug in <code>linux-firmware-intel</code> version 20251125 that caused the Intel AX210 card to negotiate very low RX bitrates. Fixed by downgrading to the October firmware (<code>sudo pacman -U /var/cache/pacman/pkg/linux-firmware-intel-20251021-1-any.pkg.tar.zst</code>), disabling WiFi power save permanently via a systemd service, and adding <code>IgnorePkg = linux-firmware-intel</code> to <code>/etc/pacman.conf</code> until a proper fix ships.</p>
<p><strong>Keyring/Brave Re-login on Every Boot.</strong> Brave asked for login credentials after every reboot because the gnome-keyring file kept getting corrupted. This was caused by SDDM autologin. Without entering a password at login, PAM can&rsquo;t unlock the keyring. The ultimate fix was launching Brave with <code>--password-store=basic</code> in the autostart config. Documented in <a href="https://github.com/basecamp/omarchy/discussions/3523#discussioncomment-15286162" target="_blank" rel="noopener noreffer">this Omarchy discussion</a>.</p>
<p><strong>Sudoers Misconfiguration.</strong> While adding a NOPASSWD rule for a keyboard-switching script, I accidentally broke sudo access entirely. Had to boot into recovery/single-user mode to fix <code>/etc/sudoers</code>. Lesson learned: always have <code>sspaeti ALL=(ALL) ALL</code> as a separate line and be very careful with <code>visudo</code>. Documented the recovery process in an emergency recovery guide.</p>
<p><strong>Screen Recording VFR Issues.</strong> Omarchy&rsquo;s screen recorder (<code>gpu-screen-recorder</code>) produces variable frame rate videos by default, which Kdenlive can&rsquo;t edit properly. The fix is adding <code>-fm cfr</code> to the recording command. Additionally, Kdenlive&rsquo;s VAAPI hardware transcoding crashed the GPU (same MES issue), so software encoding (<code>libx264</code>) is needed for now.</p>
<p><strong>CPU Fan Noise When Plugged In.</strong> The system switched to &ldquo;performance&rdquo; CPU governor when plugged in, causing constant full-speed fans even at low load. Fixed via <code>powerprofilesctl set balanced</code> or through Omarchy&rsquo;s built-in power settings menu.</p>
<!-- ### Building yourself: Fuzzy image Finder -->
<!-- Fuzzy find my images, a tool I built as I couldn't find a replacement for Snagit. -->
<!-- [[Horizontal and Vertical Cut Out]] that Snagit provides does not work on Omarchy. I had a workaround with GIMP Horizontally and Vertically crop out, but now Editt is supporting it. [Editt](https://github.com/mirarr-app/editt) also supports horizontal cut out now. --> 
<!-- I use Satty for simple screenshotting, sometimes Figma for more advanced workflow, GIMP Horizontally and Vertically crop out for a workaround. I'm also using [Using FireShot](https://getfireshot.com/using.php#using) inside the browser for scrollable images. And there's a full list at [List of tools](https://wiki.archlinux.org/title/Screen_capture#Screenshot_software). Still trying to find the old Snagit workflow, but getting there. Note: If you are not yet on Linux, but on macOS or Windows, buy a one-time licence for Snagit and be happy ever after if you take screenshots. You can thank me later :). Ksnip has vertical and horizontal cut out too, but does not work on Wayland. -->
<!-- I built an image search first: [image-browser](https://github.com/sspaeti/dotfiles/tree/master/hypr/.config/hypr/sspaeti/image-browser), see <a href="img_Switched from macOS to Linux- 6 months in_1770740977506.webp">my image search tool</a>. --> 
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>In my case, I have a newer GPU, which suddenly shuts down because of not having all the fixes released in the drivers and software. Depending on your hardware, you might be more or less lucky.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></description>
</item>
<item>
    <title>Why Coinbase and Pinterest Chose StarRocks: Lakehouse-Native Design and Fast Joins at Terabyte Scale</title>
    <link>https://www.ssp.sh/blog/starrocks-lakehouse-native-joins/</link>
    <pubDate>Mon, 09 Feb 2026 08:41:06 &#43;0200</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/starrocks-lakehouse-native-joins/</guid><enclosure url="https://www.ssp.sh/blog/starrocks-lakehouse-native-joins/featured-image.png" type="image/png" length="0" /><description><![CDATA[<p>Why is StarRocks gaining popularity among data engineers who need fast analytics on large-scale data? To find out, I did a deep dive on the companies actually using StarRocks in production, interviewing engineers and studying technical case studies from Coinbase, Pinterest, Fresha, Grab, TRM Labs, and Shopee. They all share a similar pattern: customer-facing analytics on Snowflake got too slow, and they needed sub-second query responses without heavy pre-denormalization in Flink or Spark.</p>
<p>Two questions interested me most. Why do joins seem to be faster with StarRocks than with other OLAP databases like ClickHouse, Druid, or Pinot? And how can it deliver fast responses even when data sits on cold storage like S3?</p>
<p>This article covers their answers, the architectural innovations behind StarRocks (colocated joins, caching, cost-based optimizer), and the tradeoffs you should know before evaluating it for your own analytics needs.</p>
<h2 id="introduction">Introduction</h2>
<p>The modern analytics challenge is to serve queries fast while having data on a data lake on S3, CloudFlare R2 or other places. Usually, the advantages of data lakes (storing data easily without much care about synchronizing database schemas among tables or validating bad rows) result in the opposite of fast queries. This is acceptable if the data is for internal teams or non-critical data. But customers, or also business and domain experts, usually don&rsquo;t want to wait minutes for a database query result, only to then need to add another filter or column and wait another few minutes.</p>
<p>This workflow is even less efficient with AI agents when they initially autonomously query and narrow down the domain for you. If these queries are slow, the interaction with an agent or chatbot will be even slower.</p>
<p>Enter StarRocks, capable of querying lakehouses - Iceberg, Delta Lake, Hudi, and Hive - in place, without moving data. That&rsquo;s the pitch. Let&rsquo;s see how that works.</p>
<h2 id="what-is-starrocks">What is StarRocks?</h2>
<p>But first, what is StarRocks? The definition and short version from their <a href="https://docs.starrocks.io/docs/introduction/StarRocks_intro/" target="_blank" rel="noopener noreffer">docs</a>:</p>
<blockquote>StarRocks is a next-gen, <strong>high-performance analytical data warehouse</strong> that enables real-time, multi-dimensional, and highly concurrent data analysis.<br><br>
StarRocks has an MPP architecture and is equipped with a fully vectorized execution engine, a columnar storage engine that supports real-time updates, and is powered by a rich set of features including a fully-customized cost-based optimizer (CBO), intelligent materialized view and more.<br><br>
StarRocks supports real-time and batch data ingestion from a variety of data sources. It also allows you to directly analyze data stored in data lakes with zero data migration.</blockquote>
<p>StarRocks started as a fork of <a href="https://github.com/apache/doris" target="_blank" rel="noopener noreffer">Apache Doris</a>, but <a href="https://forum.starrocks.io/t/faq-apache-doris-vs-starrocks/128" target="_blank" rel="noopener noreffer">claims</a> to have rewritten 90% of its code since then, mainly to improve performance, stability, usability, etc.</p>
<p>In the past three years, the StarRocks team has replaced the query optimizer with a brand new Cost Based Optimizer to eliminate de-normalization, implemented a Vectorized Query Engine to improve query performance, designed a Primary Key Data Model to better handle real-time analytics scenarios, released Intelligent Materialized Views to simplify data pipelines, and rolled out many other breakthroughs.</p>
<h3 id="is-it-a-real-time-analytics-database-olap-data-warehouse">Is it a Real-Time Analytics Database, OLAP, Data Warehouse?</h3>
<p>StarRocks is also an OLAP system with fast sub-second response times, focusing solely on analytics use cases, not transactional processing. But StarRocks is more than that, as it supports joins, so the terms lakehouse architecture and data warehouse come into play, potentially suiting BI use cases better.</p>
<p>StarRocks implements and provides the capabilities of the MySQL protocol, and with its intelligent materialized views and a newly designed cost-based optimizer (CBO) built-in, it&rsquo;s a powerful tool that should save you a lot of engineering time.</p>
<p>StarRocks promises one system that can power multiple analytical scenarios, reducing system complexity with Frontend Engine (FE) and Backend Engine (BE) nodes.</p>
<h3 id="who-is-using-starrocks">Who is Using StarRocks?</h3>
<p>There are some big names, some of which we will interview after this chapter. The biggest are Pinterest, Coinbase, Naver, Fresha, Lenovo, Expedia, Trip.com and <a href="https://www.youtube.com/playlist?list=PL0eWwaesODdjjEvyaupqunQjE5Ndy7-Ku" target="_blank" rel="noopener noreffer">many more</a>.<br>







</p>
<h2 id="why-starrocks">Why StarRocks?</h2>
<p>So besides other real-time and OLAP databases such as ClickHouse, Pinot, and Druid, when do you choose StarRocks? What are the use cases?</p>
<p>First, let&rsquo;s see what real StarRocks users that I have interviewed say, before we analyze the technology decisions and techniques that make StarRocks a valid option.</p>
<h3 id="a-simplified-cluster-topology-with-support-for-fast-distributed-joins">A Simplified Cluster Topology with Support for Fast, Distributed Joins</h3>
<p>Key selling points are a simpler architecture, native fast distributed joins, intelligent Materialized Views that can refresh complex joins, and federation as a fast compute engine on top of data lakes and lakehouses with data on object storage.</p>
<p>We&rsquo;ll hear from Coinbase, Pinterest, and Fresha themselves about why they use it. And we&rsquo;ll go into joins, colocation, and caching mechanisms later.</p>
<p>But before that, let&rsquo;s understand the general architecture with two node types, Backend Engines (BE) and Frontend Engines (FE), and why it&rsquo;s perceived as simpler than others. The backend nodes can be both BEs and CNs (Compute Nodes). The backend nodes support two storage variants: one with local storage in the BE nodes and one with external storage such as S3/HDFS.</p>













  

























<figure>
<a target="_blank" href="/blog/starrocks-lakehouse-native-joins/starrocks-shared-nothing-architecture.png" title="/blog/starrocks-lakehouse-native-joins/starrocks-shared-nothing-architecture.png">

</a><figcaption class="image-caption">Image from the docs: <a href="https://docs.starrocks.io/docs/introduction/Architecture/" target="_blank" rel="noopener noreffer">Architecture | StarRocks</a></figcaption>
</figure>
<p>In the <strong>shared-nothing</strong> architecture, each BE stores a portion of the data on its local storage, and with <strong>shared-data</strong>, all data is on object storage or HDFS and each CN has only cache on local storage. The default is shared-nothing, meaning direct access to local data on the BE node, but the future is Shared-Data mode to be more cloud-native and reading directly off object store. The convenient thing is that you choose where the data is stored based on your needs.</p>
<p>Keep this in mind while we go through more of the interviews and pros and cons. There are more details in the &ldquo;StarRocks Technology Decisions&rdquo; chapter, but let&rsquo;s first dive into the valuable insights from actual users and companies who explicitly chose this architecture, and why they did so.</p>
<h2 id="why-coinbase-chose-starrocks">Why Coinbase Chose StarRocks</h2>
<p>In this part, <a href="https://www.linkedin.com/in/ericsun/" target="_blank" rel="noopener noreffer">Eric Sun</a>, Head of Data Platform + Datastores at Coinbase, long-time LinkedIn Manager and experienced with data systems, gave me the pleasure of interviewing him and learning from his expertise. Below are his answers to questions regarding StarRocks at Coinbase, and anecdotes from his personal past.</p>
<p>Coinbase is the largest cryptocurrency exchange in America with over 100 million users and 200 cryptocurrencies. This means they process billions of transactions daily, even more so in short specific periods when the market is active and everyone wants to sell or buy.</p>
<h3 id="the-origin-story-for-starrocks-at-coinbase">The Origin Story for StarRocks at Coinbase</h3>
<p>That&rsquo;s why they started using StarRocks. Originally, 90% of the workloads were on Snowflake, but they needed a faster Operational Data Store (ODS) engine for crypto data services.</p>
<p>Other solutions such as TiDB were tested, but it was too disruptive to bring another transaction processing database into the company. Compared to Pinot, ClickHouse, and Druid, they ultimately chose StarRocks because of these strengths:</p>
<ul>
<li>Ingest with light transformation ⇨ pre-aggregate ⇨ analytics lifecycle is still too long/slow for other DWHs like Snowflake and Databricks</li>
<li>Query performance for ad-hoc and online services requires pre-aggregated, pre-warmed, and pre-cached engines like StarRocks, Doris, Trino, or ClickHouse. Basically <strong>balancing both fast data ingestion and join capability</strong>, but also near real-time data serving.</li>
<li>Costs of Snowflake and Databricks are both too high for the use case</li>
</ul>
<p>And they quickly expanded StarRocks to use for trade/exchange data, event/clickstream, and as a Facebook Scuba alternative.</p>
<h3 id="the-join-question">The JOIN Question</h3>
<p>A key and distinctive advantage that StarRocks seems to have over its competitors is the ability to JOIN. Almost like a data warehouse, but with the speed of an OLAP database. I asked Eric how they use joins and if the feature lives up to its promises.</p>
<p>The question that interested me most was: &ldquo;Can you join and perform simple ETL without persisting data in an intermediate step (like a data mart)? Is that all done automatically with StarRocks? And if so, how can this be fast enough?&rdquo;</p>
<p>Because you can&rsquo;t overcome the laws of physics, reading from S3 is just slow. He said the two key techniques are <strong>hash-distributed partition + colocation for multiple <em>big</em> tables</strong> to join with minimal overhead. He says &ldquo;this is nothing new, but most other engines, including Snowflake, Databricks, and ClickHouse, have not incorporated these two simple-yet-effective traits&rdquo;.</p>
<p>He continued with &ldquo;S3 is slow, so frequently queried data chunks must be automatically cached to the BE (backend) nodes of StarRocks (after the warm-up queries) in the &lsquo;shared-data&rsquo; mode&rdquo;.</p>
<p>Further, I asked him: &ldquo;But how do you model the data flow then? How does StarRocks fit the picture, are you doing any ETL before landing in StarRocks?&rdquo;</p>
<p>He surprised me with data modeling still being the key:</p>
<blockquote>Transactional Processing (OLTP or relational) data models in most cases are not a good fit for StarRocks or ClickHouse. But if the join/foreign keys are clearly defined in the data models, StarRocks can leverage <a href="https://docs.starrocks.io/docs/using_starrocks/Colocate_join/">Colocate Join</a> to reduce pre-join via ETL.<br><br>However, pre-join can always bring visible benefits to query performance and data quality. The point here is about <strong>how much percentage data models can be efficiently served via StarRocks without streaming joins</strong> in Flink/Spark which requires highly-skilled engineers.</blockquote>
<p>He continues with denormalization and colocation by dimension:</p>
<blockquote>Event data are typically denormalized to some extent, so we typically just need to colocate the big event table with the big User/Customer/Product dimension table via the <code>USER_ID</code> / <code>CUSTOMER_ID</code> / <code>PRODUCT_ID</code></blockquote>
<p>We will explore colocate joins in a bit, as this seems to be a key part of the speed for joins. Also, the distinction between when to stream and when to use StarRocks is super helpful and shows again that almost all good designs come down to good data architecture and modeling your data flow.</p>
<p>On top of that, Eric mentioned that &ldquo;query planning and data modeling is the key to the success of their project&rdquo;. For example, they use Kafka as a Sink and land data before going into StarRocks as one modeling choice for certain data.</p>
<h4 id="colocated-joins-how-they-work">Colocated JOINS: How They Work</h4>
<p>Interestingly, colocating dimensions with the large fact tables is a noteworthy approach. Let&rsquo;s quickly explore how these colocated joins work in detail to understand the importance of this, before we get back to the interview.</p>
<p>Colocate Join lets equi-joins run locally by ensuring tables share the same bucketing key, bucket count, and replica placement so corresponding bucket copies reside on the same backend nodes, <strong>avoiding network shuffle</strong> or <strong>broadcast overhead</strong>.</p>
<p>Tables that should join together are organized into a <strong>Colocation Group (CG)</strong> with the same schema (<strong>Colocation Group Schema (CGS)</strong>), consisting of these three properties that must be the same for all tables:</p>
<ul>
<li>Same <strong>bucketing</strong> key (type and order). e.g., both tables use <code>customer_id INT</code></li>
<li>Same <strong>number of buckets</strong>. e.g., both have 8 buckets</li>
<li>Same <strong>replica count</strong>. e.g., both have 3 replicas</li>
</ul>
<p>This guarantees that rows with the same <code>customer_id</code> always land on the same node, making joins local.<br>













  

























<figure>
<a target="_blank" href="/blog/starrocks-lakehouse-native-joins/starrocks-colocation.png" title="/blog/starrocks-lakehouse-native-joins/starrocks-colocation.png">

</a><figcaption class="image-caption">Colocation in StarRocks Overview</figcaption>
</figure></p>
<p>There&rsquo;s much more. Check out the docs on <a href="https://docs.starrocks.io/docs/using_starrocks/Colocate_join/" target="_blank" rel="noopener noreffer">Colocated joins</a> and <a href="https://docs.starrocks.io/docs/best_practices/query_tuning/schema_tuning/" target="_blank" rel="noopener noreffer">Schema Tuning Recipes</a>. But this also shows that there&rsquo;s a lot of planning involved in how you model your data, as these settings need to be consistent across tables and optimized for your use case and data.</p>
<h3 id="trade-offs-key-for-partitions">Trade-offs: Key for Partitions</h3>
<p>There are always tradeoffs. As seen with the colocation and schema groups, you need to define what to colocate on. You need to know your data well.</p>
<p>It&rsquo;s the same as we always did in our work, deciding what to partition on, but here it&rsquo;s done more holistically across nodes and tables. It&rsquo;s a tradeoff and a choice which columns you do exactly that for.</p>
<p>For example, Coinbase with its Bitcoin data has distinct addresses and dates, so you could either partition by range on such high-cardinality addresses or by blockchain timestamp, but you can&rsquo;t use both. You need to decide.</p>
<p>On distribution key tradeoffs, Eric says:</p>
<blockquote>
<p>You can only optimize for one of them and leverage the index for the other one. That&rsquo;s a trade-off you have to do.</p>
</blockquote>
<p>This also fits into how you model the data so you don&rsquo;t need to <strong>join</strong>, or when you do, you can join efficiently. StarRocks excels at joins using primary keys or unique keys.</p>
<p>For Coinbase&rsquo;s blockchain use case with high-cardinality address columns, Eric and <a href="https://www.linkedin.com/in/xinyu-liu-769512a/" target="_blank" rel="noopener noreffer">Xinyu</a> <a href="https://www.youtube.com/watch?v=Wl25FFBJPZA&amp;embeds_referring_euri=https%3A%2F%2Fcelerdata.com%2F" target="_blank" rel="noopener noreffer">recommend</a>:</p>
<ol>
<li><strong>Partition by timestamp</strong> (monthly) rather than by address, as high cardinality makes address-based partitioning operationally unscalable.</li>
<li><strong>Distribute by only one address column</strong> (from OR to, not both). If you distribute by both, queries must use AND predicates, which rarely match actual query patterns.</li>
<li><strong>Use ORDER BY</strong> on the distribution column to avoid needing bitmap indexes on that column.</li>
<li><strong>Create a secondary lean table</strong> optimized for the other address column, then join back using compound keys (transaction ID + timestamp + block hash).</li>
</ol>
<h4 id="struggles-with-starrocks">Struggles with StarRocks</h4>
<p>When asked what they lost or struggled with when switching to StarRocks, Eric said that the community is much smaller than ClickHouse&rsquo;s, for example. A well-known community and perception in the US is important, as most people in the Bay Area and Seattle area have still never heard about StarRocks.</p>
<p>Compared to ClickHouse, which is buying analytical solutions and trying to be a cohesive software stack, StarRocks seems to take a simpler and slower approach, he comments.</p>
<p>He also missed the stability compared to Snowflake, as new releases sometimes introduce dozens of small errors in certain areas such as deployment, system metadata, and elastic compute. But overall, it is mainly popularity and acceptance from the user community, because &ldquo;people are more willing to try and learn the technology that they have heard about or their friends can mention&rdquo;, he says.</p>
<h3 id="forward-looking-lakehouse-architectures-and-benchmarks">Forward-Looking: Lakehouse Architectures and Benchmarks</h3>
<p>When asked about using StarRocks with Iceberg/Lakehouse solutions and loading data directly, Eric said: &ldquo;Both: hot and recent data (2 weeks ~ 3 months) are stored in StarRocks native layout on S3, such as partition, index, bucketing, colocation, …, while the cold historical data are federated from Iceberg/Delta to share the same data partitions from Lakehouse. Iceberg can’t deliver the performance compared with the native storage format&rdquo;.</p>
<p>In terms of comparison and <strong>benchmarks</strong>, Eric says that StarRocks significantly outperformed ClickHouse in their TPC-H 1TB benchmarks. ClickHouse failed 12 of 22 queries due to out-of-memory errors, particularly on join-heavy queries. The data was 10 blockchains with 300+ tables and 573 billion rows.</p>
<p>But the competition is still ongoing, he says: &ldquo;Simply put, StarRocks naturally fits much better for multi-table join scenarios especially for e-commerce and finance sectors. And ClickHouse has better out-of-box templates for observability use cases&rdquo;.</p>
<p>He continues that ClickHouse is harder to maintain, as there are so many knobs and tweaks, whereas StarRocks on average is much simpler for the engineering team to understand and learn, and more manageable.</p>
<p>When asked what they can learn from each other, Eric said this:</p>
<ul>
<li><strong>StarRocks should learn from ClickHouse</strong>: Memory Table,  Integration / Connector with other partners, and rich complex/advanced UDF/UDAF</li>
<li><strong>ClickHouse should learn from StarRocks</strong>: Sophisticated cost-based optimizer (CBO),  multi-table Materialized View, primary key table optimization for DML, Concurrent Queries and Join Join Join 🙂</li>
</ul>
<h2 id="why-fresha-switched-to-starrocks">Why Fresha Switched to StarRocks?</h2>
<p>The second interview with <a href="https://www.linkedin.com/in/anton-s-borisov/" target="_blank" rel="noopener noreffer">Anton Borisov</a>, an experienced Principal Data Architect at Fresha and a heavy user of StarRocks who is building their own tooling on top of it, gives us more valuable insights. Anton has a strong background with relational databases, specifically Postgres, and has worked with distributed OLAP systems. <a href="https://www.fresha.com/" target="_blank" rel="noopener noreffer">Fresha</a> is the world&rsquo;s leading marketplace platform for the beauty, wellness, and self-care industry, trusted by millions of consumers and businesses worldwide.</p>
<p>First, we talked about why they switched, and it was the same pattern as with Coinbase. Customer-facing analytics built on top of dbt materialization and batched into Snowflake got too slow (every 20 minutes).</p>
<p>Anton was a big ClickHouse user and wanted to use it first. But then they discovered StarRocks and found that the joins would simplify their architecture. Especially with dbt where they ended up with lots of different layers, CTEs, joins, etc. So with StarRocks, they threw the same SQL at it and got a sense of how it performed and how feasible it was. That was their first baseline of what&rsquo;s possible and how fast. They achieved that rather quickly and could optimize on top of it. Whereas with ClickHouse it was much harder to set up a first working baseline to then iterate and improve on.</p>
<p>As the first baseline was 4 seconds with lots of joins, it was a great start from which they could tweak colocation and optimize data flow.</p>
<p>Anton told me that Fresha uses two pipelines or ways of ingestion into StarRocks:</p>
<ol>
<li>Streaming data that is slightly pre-aggregated with Flink and then ingested directly into StarRocks. StarRocks then handles the optimal storage with CN (Compute Nodes), which are stateless and fetch data from S3 in <strong>shared-data mode</strong>. Then you have cache in memory and disk. You could also use Kafka Connect, Debezium, and <a href="https://docs.starrocks.io/docs/loading/" target="_blank" rel="noopener noreffer">many other ways</a> to ingest.</li>
<li>Data from S3. Here the workflow is <code>Snowflake -&gt; Spark joins and export to Iceberg -&gt; StarRocks</code>
<ul>
<li><strong>Important to note</strong>: Spark gave them the best option to add as much metadata as possible, like <a href="https://iceberg.apache.org/puffin-spec/" target="_blank" rel="noopener noreffer">Apache Puffins</a> and general Iceberg metadata, for StarRocks to quickly read from, even if not in hot storage.</li>
<li>This approach is done on a batch schedule, usually once a day. Also interestingly, they might duplicate certain datasets and store them in different sorting orders (e.g., one on dimension customer, one on dimension region), just to have StarRocks optimize reading on cold and cheaper S3 storage for different dashboards. You pay extra for S3, but you can avoid expensive hot cache if you want. You can still let StarRocks load some of it into local cache to speed things up, say if there&rsquo;s an event coming up.</li>
<li>This architecture makes it flexible without having to change data pipelines.</li>
<li>Fresha treats Iceberg tables as mostly immutable: data is ingested append-only, and they don&rsquo;t continuously apply row-level corrections. Instead, they accumulate corrections and periodically re-ingest/merge them back into Iceberg for the workloads that need corrected history. The target model is that older (non-operational) data remains stable, with only controlled correction cycles rather than ongoing mutations.</li>
</ul>
</li>
</ol>
<p>The way it works is that data from Flink lands in RAM first, then gets offloaded into cache. There is an option to specify the cache-to-memory ratio, but you can&rsquo;t control it directly, so you need to tweak the workload. They use the shared-data mode for this. Remember the above image about the architecture overview.</p>
<p>Regarding joins, Anton said that they just work as you&rsquo;d expect from a data warehouse. The speed heavily depends on how much &ldquo;money&rdquo; you invest, meaning how many CN nodes you scale up when reading from S3, for example. And because CN nodes in the <strong>shared-data architecture are stateless</strong>, it&rsquo;s easy to scale up more nodes. A great detail Anton shared: these CN nodes do RPC (<a href="https://github.com/apache/brpc" target="_blank" rel="noopener noreffer">bRPC</a> to be specific) calls with each other to communicate.</p>
<p>You can also use Materialized Views for Joins and Ingestions, but they do most in Flink or Spark.</p>
<h3 id="how-to-handle-data-updates">How to Handle Data Updates</h3>
<p>Data is updated with a continuous streaming job like Flink. StarRocks has a compaction job that optimizes the data that has been updated. It&rsquo;s an internal process where you configure some limits and times for it. Colocation, though, is handled a bit differently, either when you declare the table during ingestion or when you change the table definition with DDL. Then the engine distributes tablets to the correct nodes so they are physically together.</p>
<p>What if you need to backfill new data? Backfilling is the same as updating data. Either you set up a Flink job that ingests older missing data slowly, one batch after another. Or you do an Iceberg export that you can load from. Fresha uses <a href="https://docs.starrocks.io/docs/integrations/streaming/pipe/" target="_blank" rel="noopener noreffer">StarRocks pipes</a> for this. Either way works.</p>
<p>Schema changes are a bit different from other systems. You need to stop ingestion from Flink, change the schema and dataset coming from the Flink job, change the DDL statement for your table definition in StarRocks, and then restart the pipeline. Only then do you see the new column.</p>
<h3 id="tradeoffs">Tradeoffs</h3>
<p>Anton says these are always the same: you can have super-fast real-time ingestion, but then it&rsquo;s more expensive. Or you have a little longer latency until everything is ingested, but at a lower cost.</p>
<p>Most important is the query speed, where their baseline is under 1 second (web analytics p95 is ~100ms) across all queries. For the business user, it should not matter whether joins are happening under the hood, whether data is stored in S3 with Iceberg, or whether it is coming directly from the StarRocks internal format.</p>
<p>If data is fully cold and on S3, StarRocks still manages 3-5 seconds of query latency based on extensive Puffin and Iceberg metadata sorted in the right order, Anton says.</p>
<p>Check out also the <a href="https://medium.com/fresha-data-engineering" target="_blank" rel="noopener noreffer">Fresha Data Engineering Blog</a> for many more insights from the team about best practices around StarRocks and more. Or check out their <a href="https://github.com/gomezgoes-con/northstar" target="_blank" rel="noopener noreffer">Northstar</a> utility for StarRocks, or the recent video about <a href="https://youtu.be/3jis0HzmD2A?si=qWbFvxlw2k-B2Kyf" target="_blank" rel="noopener noreffer">StarRocks at Fresha</a>.</p>
<h2 id="pinterest-from-druid-to-starrocks">Pinterest: From Druid to StarRocks</h2>
<p>Perhaps the biggest company that uses StarRocks is Pinterest. I wasn&rsquo;t able to reach anyone to interview, but there&rsquo;s a great article online at <a href="https://medium.com/pinterest-engineering/delivering-faster-analytics-at-pinterest-a639cdfad374" target="_blank" rel="noopener noreffer">Delivering Faster Analytics</a>. Let&rsquo;s analyze it.</p>
<p>Pinterest migrated from Apache Druid to StarRocks to power their Partner Insights tool, which provides real-time analytics dashboards to advertisers tracking ad performance across 500+ million monthly active users. After the migration, they reduced p90 query latency by 50% while using only 32% of their previous infrastructure. This is roughly a 3x improvement in cost-performance efficiency.</p>
<p>Their setup runs on 70 backend engines and 11 frontend nodes. The MySQL compatibility allowed easy integration with existing tools, and StarRocks&rsquo; native ingestion eliminated the need for heavy MapReduce jobs in their data pipeline.</p>
<p>Additionally, there are some interesting comments on <a href="https://www.reddit.com/r/dataengineering/comments/1em0a5t/comment/lgvo0i6/?utm_source=share&amp;utm_medium=web3x&amp;utm_name=web3xcss&amp;utm_term=1&amp;utm_content=share_button" target="_blank" rel="noopener noreffer">Reddit</a> related to this article. People using Druid and ClickHouse stated that:</p>
<blockquote>
<p>Both can handle multi-petabyte deployments the product seems to cover gaps within both architectures (Druid has an extremely heavy/complex footprint with limited join capabilities and is fairly costly to run). While Clickhouse can handle multi-petabyte volumes, certain design choices architecturally prevent it from auto rebalancing data which is critical at larger scale data volumes.</p>
</blockquote>
<p>He would use StarRocks &ldquo;largely because user <strong>requirements are only getting more complex</strong> at our company and architecturally/capability-wise StarRocks seems to be really the only OS solution that actually has near-direct compatibility with the MySQL protocol, materialized views and supports both real-time + batch loads, upserts. There is just a lot you can enable with just those four capabilities.&rdquo;</p>
<h3 id="index-exchange--others">Index Exchange &amp; Others</h3>
<p>It sounds similar from <a href="https://bsky.app/profile/ivan-torres.bsky.social/post/3lc3vie6q222w" target="_blank" rel="noopener noreffer">Ivan Torres</a>, Staff Engineer at Index Exchange, who says:</p>
<blockquote>I use StarRocks open source on K8s and benchmarked it against Druid. I also used ClickHouse before. For real time data they all come pretty close, some optimizations of each database work better for specific use cases, but you can achieve pretty similar numbers with the three of them.<br><br>What made me go to StarRocks is that it is <strong>a lot more versatile</strong>. It also does pretty good ad-hoc analytics and <strong>integrates with external catalogs</strong> like Iceberg for reading external tables. You can also have full separation of storage and compute and achieve good performance with disk caching.</blockquote>
<p>He goes on to compare: in &ldquo;Druid, joining tables ad-hoc is pretty much impossible&rdquo;. And in ClickHouse, the &ldquo;integration with external tables is not yet there,&rdquo; he said.</p>
<p>But there are also some limitations. I had someone at a large payroll company reach out. They found it quite difficult to tune at scale and decided to migrate away from it. Difficult to manage, as in a full-time job, but powerful. They liked the StarRocks model, but didn&rsquo;t have the engineering resources to maintain and tune it beyond what they could support at the time. They split workloads between ClickHouse and Snowflake based on use case.</p>
<p>As always, it&rsquo;s a tradeoff. You need the right system for the right use case and the right engineering skills to manage it yourself, or just use the hosted version of each platform.</p>
<h2 id="common-patterns-across-adopters">Common Patterns Across Adopters</h2>
<p>Recapping before we go into the technical deep dive. Common patterns and use cases I found while interviewing the above companies and people:</p>
<ul>
<li><strong>Joins as the really powerful part</strong>: Everyone mentioned the capabilities of joins as why they chose StarRocks over ClickHouse, Druid, or Pinot. Not just that joins work, but that they work without heavy pre-denormalization in Flink/Spark first.</li>
<li><strong>Faster time-to-baseline</strong>: Getting a working baseline to iterate from, compared to more complex setups, saving costs as the engineering team understands it faster.</li>
<li><strong>Use it when you want to speed up Snowflake or directly read from Iceberg tables</strong>: Limits are hit when customer-facing apps need sub-second query-response times.</li>
<li><strong>Hybrid hot/cold as the deployment pattern</strong>: Both Coinbase and Fresha run streaming into StarRocks for recent data while federating over Iceberg for cold historical data. StarRocks becomes the unified query layer across both.</li>
</ul>
<p>Besides all the advantages, it&rsquo;s clear that good data modeling is still required at every step. Colocated joins require upfront planning on how to set the same bucket key, count, and replica placement across tables. Basically, the partition key not only for a single table, but across a set of tables for best performance. Data flow and schema design should get a lot of love, as always in my opinion.</p>
<p>Now that we have heard a lot from actual users, let&rsquo;s go into the details and analyze the parts we haven&rsquo;t covered yet under the hood to see how the speed is possible and understand the architecture decisions that have been made.</p>
<h2 id="starrocks-technology-decisions-and-architecture">StarRocks Technology Decisions and Architecture</h2>
<p>Before we end, let&rsquo;s look at some more of the interesting architectural decisions StarRocks makes.</p>
<h3 id="caching-the-alternative-to-ingestion">Caching: The Alternative to Ingestion</h3>
<p>In shared-data mode, data sits on slower object storage. StarRocks mitigates this with a multi-tier cache: <code>memory -&gt; local disk -&gt; remote storage</code>. Queries hit hot cache first, and cold data gets prefetched based on optimized strategies.</p>
<p>This means you don&rsquo;t need ETL jobs to &ldquo;load&rdquo; data into StarRocks. You create an Iceberg catalog, point it at your existing tables, and queries automatically warm the cache over time. For predictable workloads, use the <a href="https://docs.starrocks.io/docs/data_source/block_cache_warmup/" target="_blank" rel="noopener noreffer">cache warmup</a> command to proactively load specific tables before users hit them.</p>
<p><strong>How it works under the hood:</strong> When querying Iceberg on S3, StarRocks splits remote Parquet/ORC files into fixed-size blocks (default 1MB) and caches them locally. Each block gets a unique key based on filename, modification time, and block ID. The first query fetches from remote storage, and subsequent queries read from local NVMe/SSD. Cached data persists across restarts.</p>
<p>On top of block cache, an in-memory page cache stores decompressed data pages and metadata. Hot data lives in memory, warm data on disk, cold data stays on S3. StarRocks also caches <a href="https://docs.starrocks.io/docs/data_source/catalog/iceberg/iceberg_catalog/" target="_blank" rel="noopener noreffer">Iceberg metadata</a> (manifests, schemas) to avoid catalog round-trips on every query.</p>
<p>The tradeoff is that the first queries on cold data still hit remote storage, but once warmed, performance matches native tables.</p>
<h3 id="real-time-updates-without-the-merge-overhead">Real-Time Updates Without the Merge Overhead</h3>
<p>Beyond caching, StarRocks handles real-time data updates efficiently through its columnar storage engine. Data of the same type is stored contiguously, enabling better compression and reduced I/O as you query only the columns you need.</p>
<p>But what makes it interesting for real-time analytics is <em>how</em> it handles updates. Traditional OLAP systems use different strategies:</p>













  

























<figure>
<a target="_blank" href="/blog/starrocks-lakehouse-native-joins/starrocks-merge-on-read.png" title="/blog/starrocks-lakehouse-native-joins/starrocks-merge-on-read.png">

</a><figcaption class="image-caption">Image from <a href="https://docs.starrocks.io/docs/introduction/Features/" target="_blank" rel="noopener noreffer">StarRocks Docs</a></figcaption>
</figure>
<p>StarRocks uses the <strong>delete-and-insert</strong> pattern. Instead of merge-on-read (which pays the merge cost at query time) or copy-on-write (which rewrites entire files on updates), StarRocks maintains a primary key index with delete bitmaps. Updates mark old rows as deleted and insert new ones. No expensive sort-merge at read time. This means sub-second data visibility for upserts while keeping query latency predictable, even on large update volumes.</p>
<p>The storage engine guarantees ACID for each ingestion operation: transactions either fully succeed or fail, with isolation between concurrent loads.</p>
<h3 id="cost-based-optimizer-cbo-avoiding-joins-without-upfront-denormalization">Cost-Based Optimizer (CBO): Avoiding Joins Without Upfront Denormalization?</h3>
<p>On-the-fly joins aren&rsquo;t typically a strength of OLAP databases, but they are one of StarRocks&rsquo; strengths, as we discussed. If your execution engine is fast enough, you avoid pre-building denormalized data marts, saving storage costs and ETL complexity.</p>
<p>Why are StarRocks joins fast?</p>
<p>The <a href="https://docs.starrocks.io/docs/using_starrocks/Cost_based_optimizer/" target="_blank" rel="noopener noreffer">cost-based optimizer</a> navigates the exponential search space of join plans, automatically transforms expensive join types into cheaper ones, and aggressively pushes predicates down before data reaches the join.</p>
<p>The optimizer leverages rich statistics including histograms for skewed data distributions and multi-column joint statistics to produce accurate cardinality estimates that guide join ordering and execution strategy selection.</p>
<p><strong>When you still want denormalization:</strong> For high-concurrency scenarios serving hundreds of simultaneous users, pre-aggregated views still win. The difference is you start with normalized data, query it directly, and selectively add materialized views where needed. Fresha, NAVER, and others use this feature, e.g., NAVER achieved <a href="https://celerdata.com/blog/how-join-changed-how-we-approach-data-infra-at-naver" target="_blank" rel="noopener noreffer">6x speedups</a> on specific high-traffic queries.</p>
<h3 id="intelligent-materialized-views">Intelligent Materialized Views</h3>
<p>StarRocks reads Iceberg/Hive tables <strong>in-place</strong> without copying data. Its vectorized engine processes Parquet/ORC directly, and the only overhead is metadata lookup. No transformation is needed.</p>
<p>The query flow:</p>
<ol>
<li>Frontend receives query → queries catalog for table metadata</li>
<li>Extracts file paths from Iceberg manifests (no data read yet)</li>
<li>Distributes file locations to Backend nodes</li>
<li>Backend opens Parquet/ORC directly from S3, applies predicate pushdown</li>
<li>Reads only required columns (late materialization)</li>
</ol>
<p>On top of this, StarRocks&rsquo; <a href="https://docs.starrocks.io/docs/using_starrocks/async_mv/Materialized_view/" target="_blank" rel="noopener noreffer">intelligent materialized views</a> auto-refresh based on base table changes and are selected automatically at query time. The optimizer rewrites queries to use MVs when beneficial. No manual intervention is needed.</p>
<p>This enables a layered approach to data modeling without traditional ETL pipelines:</p>













  

























<figure>
<a target="_blank" href="/blog/starrocks-lakehouse-native-joins/starrocks-denormalization-strategies.png" title="/blog/starrocks-lakehouse-native-joins/starrocks-denormalization-strategies.png">

</a><figcaption class="image-caption"><a href="https://docs.starrocks.io/docs/introduction/Features/" target="_blank" rel="noopener noreffer">Database Features | StarRocks</a>_</figcaption>
</figure>
<p>Starting from the bottom: raw data sits in your data lake (Iceberg, Hudi, Delta Lake, Hive). An <strong>external catalog MV</strong> can transform this into normalized tables, still queryable for ad-hoc analysis and OLAP workloads. From there, <strong>async MVs</strong> can create denormalized tables for faster OLAP queries. For high-concurrency standard reports, <strong>aggregation MVs</strong> (roll-ups) pre-compute the heavy lifting.</p>
<p>The key insight is that you don&rsquo;t build all these layers upfront. You start with normalized data, query it directly, and see if it&rsquo;s fast enough. Then be selective and add MVs where you see bottlenecks. You preserve the single source of truth in Iceberg while progressively optimizing hot paths.</p>
<p>To repeat, at a high level, StarRocks achieves its performance through four architectural decisions: <strong>Colocate Join</strong> for zero network overhead on co-located data, <strong>delete-and-insert</strong> for O(1) updates instead of merge-on-read, <strong>direct Parquet/ORC</strong> reading with no ingestion transformation, and <strong>SIMD Vectorization</strong> for faster filtering/aggregation on Parquet computation to determine whether data rows are empty.</p>
<h2 id="conclusion-is-starrocks-too-good-to-be-true">Conclusion: Is StarRocks Too Good to Be True?</h2>
<p>So when should you actually choose StarRocks? After interviewing Coinbase, Fresha, and digging into Pinterest&rsquo;s migration, the pattern is clear. <strong>Choose StarRocks when joins are central to your analytics.</strong></p>
<p>ClickHouse excels at single-table aggregations and observability. But if you&rsquo;re constantly pre-denormalizing in Flink or Spark just to avoid joins, StarRocks lets you skip that pain.</p>
<p>Here&rsquo;s my mental model as of now:</p>
<table>
  <thead>
      <tr>
          <th>Use Case</th>
          <th>Choose</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Real-time with complex joins</td>
          <td><strong>StarRocks</strong></td>
          <td>Native MPP shuffle joins, mature <a href="https://docs.starrocks.io/docs/using_starrocks/Cost_based_optimizer/" target="_blank" rel="noopener noreffer">CBO</a> with 5 join strategies, and high concurrency (1,000s of users). ClickHouse made join improvements in 2025 but lacks distributed shuffle, when <a href="https://clickhouse.com/docs/faq/general/distributed-join" target="_blank" rel="noopener noreffer">not colocated</a>.</td>
      </tr>
      <tr>
          <td>Query Iceberg/Hive directly without ETL</td>
          <td><strong>StarRocks</strong> — but evaluate ClickHouse too</td>
          <td>StarRocks has a small edge with <a href="https://docs.starrocks.io/docs/using_starrocks/async_mv/use_cases/data_lake_query_acceleration_with_materialized_views/" target="_blank" rel="noopener noreffer">MVs on lake tables</a>, cross-node data cache, and <a href="https://docs.starrocks.io/releasenotes/release-4.0/" target="_blank" rel="noopener noreffer">Iceberg compaction API</a>. ClickHouse improved with DataLakeCatalog, native Parquet reader (<a href="https://docs.starrocks.io/docs/data_source/catalog/iceberg/iceberg_catalog/" target="_blank" rel="noopener noreffer">Iceberg catalog docs</a>)</td>
      </tr>
      <tr>
          <td>Frequent updates/deletes with sub-second visibility</td>
          <td><strong>StarRocks</strong> for heavy CDC; <strong>ClickHouse</strong> for moderate updates</td>
          <td>StarRocks&rsquo; <a href="https://docs.starrocks.io/docs/table_design/table_types/primary_key_table/" target="_blank" rel="noopener noreffer">Primary Key table</a> is GA and purpose-built for continuous upserts with native Flink CDC support. ClickHouse&rsquo;s Lightweight Updates (25.7) made this faster but still experimental.</td>
      </tr>
      <tr>
          <td>Single-table, high-volume observability</td>
          <td><strong>ClickHouse</strong></td>
          <td>Extended leads here with query condition cache, lazy materialization, ClickStack, and full-text search redesign. StarRocks&rsquo; <a href="https://docs.starrocks.io/docs/table_design/indexes/inverted_index/" target="_blank" rel="noopener noreffer">inverted index</a> is still experimental.</td>
      </tr>
  </tbody>
</table>
<p>Again, choose StarRocks when you hit cloud DWH speed limits for customer-facing analytics, need real-time analytics (&lt;50ms), have frequent updates/deletes, want to query Iceberg/Hive directly, need complex joins, or want MVs on external tables.</p>
<p>But advanced features don&rsquo;t come without tradeoffs. You have to choose a partition key across colocated tables, but you can only optimize for one of them. Overall, good data modeling still matters. And if one of them is your main use case, it&rsquo;s always best to create a Proof of Concept where you test and compare the differences.</p>
<p>If you&rsquo;re looking for a real-time BI and dashboarding tool designed to benefit from StarRocks&rsquo; real-time query performance, check out Rill and the recently released <strong><a href="https://docs.rilldata.com/developers/build/connectors/olap/starrocks" target="_blank" rel="noopener noreffer">native StarRocks connector</a></strong>. With it, you can connect directly to your StarRocks instance and query data in real time without first ingesting it into Rill, expanding Rill&rsquo;s support alongside ClickHouse, Druid, and Pinot. You simply specify StarRocks in their source YAML and off you go. Check out the docs for more information. All open-source if you want. <a href="https://ui.rilldata.com/" target="_blank" rel="noopener noreffer">Cloud-ready</a> if you need.</p>
<p>Special thanks to <a href="https://www.linkedin.com/in/ericsun/" target="_blank" rel="noopener noreffer">Eric</a> and <a href="https://www.linkedin.com/in/anton-s-borisov/" target="_blank" rel="noopener noreffer">Anton</a>, who took the time to answer my questions and helped me learn a lot about how StarRocks works. Follow them on LinkedIn and subscribe to their blogs and posts.</p>
<h2 id="references">References</h2>
<ol>
<li>
<p>Bajaj, K., Luo, Z., Yang, Y., Barai, S., &amp; Hu, M.-M. (2024, July 31).<br>
<em>Delivering Faster Analytics at Pinterest</em>. Pinterest Engineering Blog.<br>
<a href="https://medium.com/pinterest-engineering/delivering-faster-analytics-at-pinterest-a639cdfad374" target="_blank" rel="noopener noreffer">Link</a> — Describes Pinterest&rsquo;s migration from Druid to StarRocks for their<br>
Partner Insights platform, achieving 50% p90 latency reduction at<br>
32% of the previous instance count, resulting in a 3x cost-performance<br>
improvement.</p>
</li>
<li>
<p>Vuong, H., &amp; Cao, H. N. (2025, March 6). <em>Building a Spark observability<br>
product with StarRocks: Real-time and historical performance analysis</em>.<br>
Grab Tech Blog.<br>
<a href="https://engineering.grab.com/building-a-spark-observability" target="_blank" rel="noopener noreffer">Link</a> — Describes Grab&rsquo;s &ldquo;Iris&rdquo; Spark observability platform redesign, migrating<br>
from a TIG stack (Telegraf/InfluxDB/Grafana) to a StarRocks-centered<br>
architecture to unify real-time + historical analysis and simplify<br>
ingestion/visualization.</p>
</li>
<li>
<p>Shekhawat, V., &amp; Andrews, M. (n.d.). <em>From BigQuery to Lakehouse: How<br>
We Built a Petabyte-Scale Data Analytics Platform – Part 1</em>. TRM Blog.<br>
<a href="https://www.trmlabs.com/resources/blog/from-bigquery-to-lakehouse-how-we-built-a-petabyte-scale-data-analytics-platform-part-1" target="_blank" rel="noopener noreffer">Link</a> — Explains TRM Labs&rsquo; move from BigQuery + distributed Postgres toward a<br>
lakehouse architecture, selecting Apache Iceberg for table format and<br>
StarRocks as the query engine for low-latency, high-concurrency<br>
user-facing analytics.</p>
</li>
<li>
<p>Event Recap, StarRocks Singapore Meetup #2 @Shopee. (n.d.). 知乎专栏<br>
(Zhihu).<br>
<a href="https://zhuanlan.zhihu.com/p/1888656940533526592" target="_blank" rel="noopener noreffer">Link</a> — Event recap for a StarRocks community meetup hosted at Shopee&rsquo;s<br>
Singapore office, describing talks and themes around customer-facing<br>
analytics use cases.</p>
</li>
<li>
<p>Shen, S., &amp; Sun, E. (2024, June). <em>Data Warehouse Performance on the Data Lakehouse</em> [Lightning Talk]. Data+AI Summit 2024, Databricks. <a href="https://www.youtube.com/watch?v=UTRcEqcTx4g" target="_blank" rel="noopener noreffer">Link</a> — A joint talk by CelerData and Coinbase presenting how StarRocks delivers data warehouse-level query performance directly on the data lakehouse.</p>
</li>
</ol>
<hr>
<pre class=""><em>Full article published at <a href="https://www.rilldata.com/blog/why-coinbase-and-pinterest-chose-starrocks-lakehouse-native-design-and-fast-joins-at-terabyte-scale" target="_blank" rel="noopener noreferrer">Rilldata.com</a> - written as part of <a href="/services">my services</a></em></pre>
]]></description>
</item>
<item>
    <title>A Diary of a Data Engineer</title>
    <link>https://www.ssp.sh/blog/diary-of-a-data-engineer/</link>
    <pubDate>Tue, 13 Jan 2026 10:36:39 &#43;0100</pubDate>
    <author>Simon Späti</author>
    <guid>https://www.ssp.sh/blog/diary-of-a-data-engineer/</guid><enclosure url="https://www.ssp.sh/blog/diary-of-a-data-engineer/featured-image.jpg" type="image/jpeg" length="0" /><description><![CDATA[<p>You ingest data. You model it. You transform it. You serve it. Someone asks for a change. Everything breaks. You rebuild. This is the loop. It was the loop in 2005 with SSIS and star schemas. It&rsquo;s the loop in 2025 with dbt and Iceberg, or 2026 with prompting AI agents.</p>
<p>The tools change. The loop doesn&rsquo;t.</p>
<h2 id="the-invisible-plumbers">The Invisible Plumbers</h2>
<p>When I started my career in 2003, there was no &ldquo;data engineering&rdquo;. There was no big data, no data science. We called it Business Intelligence. Data Warehouse Developer. ETL Developer.</p>
<p>We were the plumbers of the organization. And like plumbers, nobody noticed us until something broke.</p>
<p>Being a data engineer means: you&rsquo;re building the foundation that everyone stands on, but when the presentation goes well, the data scientist, the app developer, anyone who presents gets the applause. When the executive makes the right decision, the analyst gets the credit. When the dashboard loads in 1 second instead of 20, nobody says anything at all.</p>
<p>But when one number is wrong? When a pipeline is 10 minutes late? When someone asks for &ldquo;a small change&rdquo; and you explain it&rsquo;ll take a day, or a week to fix it?</p>
<p>That&rsquo;s when everyone notices you. And shares their opinion on how to make it better.</p>
<p>&ldquo;Why does this take so long? It&rsquo;s just one column. Why isn&rsquo;t it real-time?&rdquo;</p>
<p>They don&rsquo;t see the 147 downstream dependencies. The three systems that need a fuzzy-logic join. Or the security measures that go through three different subnetworks. The backfill that&rsquo;ll take 6 hours to run. The schema that hasn&rsquo;t been touched since 2021 because the last person who understood it left the company long ago.</p>
<p>This is the paradox of data engineering: when you do your job, you&rsquo;re invisible. When anything goes wrong, you&rsquo;re under a microscope.</p>
<h2 id="the-epochs-a-50-year-journey">The Epochs: A 50-Year Journey</h2>
<p>To understand where we are today, you need to understand where we came from.</p>
<h3 id="1970s-the-beginning">1970s: The Beginning</h3>
<p>Edgar F. Codd proposed [[SQL]] in 1970. A way to abstract the complexities of data storage. By the 1980s, it became the standard. IBM built System R. Oracle launched their RDBMS in 1979.</p>
<p>The foundation was laid. But nobody called it &ldquo;data engineering&rdquo; yet.</p>
<h3 id="1980s-1990s-the-warehouse-era">1980s-1990s: The Warehouse Era</h3>
<p>[[Bill Inmon]] formalized data warehousing principles in the 1980s. Many call him the father of data warehousing. Then in 1996, Ralph Kimball published &ldquo;[[The Data Warehouse Toolkit (Ralph Kimball)|The Data Warehouse Toolkit]]&rdquo; and gifted us with [[dimensional modeling]]—star schemas, fact tables, slowly changing dimensions.</p>
<p>These concepts? They&rsquo;re still relevant today.</p>
<h3 id="2000s-when-big-changed-everything">2000s: When &ldquo;Big&rdquo; Changed Everything</h3>
<p>The dot-com bubble burst. Tech titans were born such as Google, Amazon, Yahoo, hitting walls their databases couldn&rsquo;t scale past.</p>
<p>So Google released two [[Data Engineering Whitepapers|groundbreaking papers]]: the Google File System in 2003, MapReduce in 2004. Yahoo responded with Hadoop in 2006. Hardware prices plummeted.</p>
<p>Suddenly, we weren&rsquo;t just BI engineers anymore. We were &ldquo;<strong>Big Data Engineers</strong>&rdquo;. We had to know traditional relational databases AND the new open-source filesystems. The skillset kept expanding—from data modeling to software development to mastering Hive and Spark, all coordinated with R and Python.</p>
<p>The term &ldquo;big&rdquo; was everywhere. But how big is &ldquo;big&rdquo;? Nobody really knew. We just knew the old ways weren&rsquo;t working anymore. And Facebook and co showed us the way.</p>
<h3 id="2010s-the-cloud-changes-the-game">2010s: The Cloud Changes the Game</h3>
<p>Amazon announced AWS. Google Cloud and Azure followed. Companies no longer needed to own hardware. The flexibility was unprecedented, and we could get any DWH on demand.</p>
<p>Redshift. Snowflake. And then the open-source wave hit:</p>
<ul>
<li>Airflow for orchestration (2014)</li>
<li>Superset for visualization (2015)</li>
<li>dbt for transformation (2016)</li>
</ul>
<p>And in 2017, Maxime Beauchemin—after creating both Airflow and Superset—published &ldquo;<a href="https://medium.com/free-code-camp/the-rise-of-the-data-engineer-91be18f1e603" target="_blank" rel="noopener noreffer">The Rise of the Data Engineer</a>&rdquo;. He defined, for the first time, what data engineering actually meant. He explained the shift from business intelligence to data engineering.</p>
<p>I remember releasing my first viral article in March 2018: &ldquo;<a href="https://www.ssp.sh/blog/data-engineering-the-future-of-data-warehousing/" target="_blank" rel="noopener noreffer">Data Engineering, the future of Data Warehousing?</a>&rdquo; It got 200 likes. Back then, that was a lot 😉.</p>
<p>Since then? New technologies appeared weekly. The [[Modern Data Stack]] was born.</p>
<h3 id="2020s-devops-meets-data-engineering">2020s: DevOps Meets Data Engineering</h3>
<p>This is where it gets interesting.</p>
<p>Data engineering isn&rsquo;t just about moving data anymore. It&rsquo;s about <strong>infrastructure as code</strong>, version control for data, CI/CD pipelines, Kubernetes, Docker, and Terraform.</p>
<p>The skills needed have exploded. You need to know:</p>
<ul>
<li>SQL (still the foundation)</li>
<li>Python or Scala</li>
<li>Cloud infrastructure (AWS/GCP/Azure)</li>
<li>Linux and bash scripting</li>
<li>Git for version control</li>
<li>Data modeling (the lost art)</li>
<li>Business logic (the most important)</li>
</ul>
<p>DevOps principles are now [[The State of DevOps in Data Engineering|table stakes]]. You&rsquo;re not just building pipelines. You&rsquo;re building systems that need to self-heal, auto-scale, and deploy without downtime on any environment.</p>
<p>And today? AI agents? They&rsquo;re the latest chapter. But under all the hype is the same eternal truth: <strong>you need fresh, organized, clean data.</strong></p>
<h2 id="the-eternal-loop-same-problems-new-tools">The Eternal Loop: Same Problems, New Tools</h2>
<p>Here&rsquo;s the uncomfortable truth: we&rsquo;ve been solving the same problems for 50 years.</p>
<p>In 2005, we had SSIS and star schemas. &ldquo;The cube is rebuilding&rdquo; was the pain point.</p>
<p>In 2015, we had Hadoop and Spark. &ldquo;The cluster is full&rdquo; was the nightmare.</p>
<p>In 2025, we have dbt and Snowflake. &ldquo;The bill is how much?&rdquo; is the new horror story.</p>
<p>The tools change. The problems don&rsquo;t.</p>
<p>Last month I analyzed a 200-line dbt model as part of a larger GitHub repository. You know what it was doing? Exactly what we did in 2005 with stored procedures. Same business logic. Different syntax. I laughed. Then I cried a little. (just kidding, I didn&rsquo;t 😆)</p>
<p>An old data warehouse architect from 2003 once drew a star schema on a whiteboard in 40 seconds. It would take my team three sprints to model in Oracle Warehouse Builder (OWB). He said, &ldquo;We called it just another day at the office&rdquo;.</p>
<p>We&rsquo;re not really any smarter than the people before us. We just have better marketing 😉.</p>
<h2 id="what-actually-matters-and-what-doesnt">What Actually Matters (And What Doesn&rsquo;t)</h2>
<p>Here&rsquo;s what I&rsquo;ve learned after 20+ years.</p>
<h3 id="the-excel-file-that-saved-me">The Excel File That Saved Me</h3>
<p>I was in a coffee meeting with a finance analyst—call her Maria. Fifteen years at the company. She opened her laptop and showed me an Excel file (sometimes it was Microsoft Access DB with a custom UI!).</p>
<p>Forty-seven tabs. Formulas referencing other files on a shared drive. VBA macros from 2012. VLOOKUP nested inside SUMIF.</p>
<p>&ldquo;This is how we calculate the quarterly forecast&rdquo;.</p>
<p>From my perspective of making it available to everyone and needing to understand it, I was a little shocked. I&rsquo;d spent three weeks reverse-engineering the business logic, trying to understand it, trying to recreate it in SQL Server, and adding it to our data warehouse. But the numbers were never the same, close most of the time.</p>
<p>After having multiple such experiences, sometimes Microsoft Access databases with custom UI built in (!!!), I learned something. Though my initial reaction was shock and horror, I learned that <strong>Excel isn&rsquo;t the enemy</strong>. Excel is the <strong>business telling you what they actually need</strong>.</p>
<p>When someone asks to export to Excel, they&rsquo;re not rejecting your work. They&rsquo;re telling you something. Maybe your dashboard is too slow. Maybe they need to add a column you didn&rsquo;t think of. Maybe they just need to feel in control of their analysis.</p>
<p>Power users will overengineer everything, but ask them for the Excel file and you might get validated business logic and ETL code for free. Win-win.</p>
<h3 id="the-real-time-lie">The Real-Time Lie</h3>
<p>Everyone wants real-time. &ldquo;We need to see this data instantly&rdquo;, they say. &ldquo;For decision-making&rdquo;.</p>
<p>I always ask: &ldquo;What decision will you make differently if you see it 10 minutes sooner?&rdquo;</p>
<p>Most of the time, they can&rsquo;t answer. There&rsquo;s a small percentage that needs it: air traffic control, fraud detection, Black Friday e-commerce. But the rest? They just think real-time serves them better.</p>
<p>Real-time adds a much higher complexity. Harder debugging. Harder backfills. Harder historization. The question is, for what? So someone can watch a number update every 30 seconds instead of every hour?</p>
<p>Push back on &ldquo;real-time&rdquo;. Start with hourly refreshes. It&rsquo;s almost always enough.</p>
<h3 id="the-schema-change-a-people-problem">The Schema Change: A People Problem</h3>
<p>They said it was small. Just renaming <code>user_id</code> to <code>customer_id</code>.</p>
<p>You trace the lineage. 147 downstream dependencies. Three teams. One undocumented view from 2019 that somehow powers the CEO&rsquo;s dashboard.</p>
<p>That&rsquo;s when you realize: <strong>schema changes are usually people problems</strong>, not technology problems. The reason things break is when upstream producers don&rsquo;t own responsibility for downstream analytics and don&rsquo;t communicate the changes. There&rsquo;s no process in place. Just assumptions.</p>
<p>Fix the people process first, then update the code.</p>
<h2 id="the-lost-art-of-data-modeling">The Lost Art of Data Modeling</h2>
<p>Max Beauchemin once <a href="https://www.heavybit.com/library/podcasts/data-renegades/ep-3-building-tools-that-shape-data-with-maxime-beauchemin" target="_blank" rel="noopener noreffer">said</a> in an interview: &ldquo;I like the analysis side. I think I&rsquo;m a good data modeler. It&rsquo;s kind of a lost art, so I still do a lot of our data pipelines&rdquo;.</p>
<p>He&rsquo;s right.</p>
<p>After years of &ldquo;just dump it in the data lake&rdquo;, people are rediscovering that structure matters. Data modeling forces you to think about:</p>
<ul>
<li><strong>[[Granularity|Grain]]</strong>: What&rsquo;s the lowest level of detail we need for this data?</li>
<li><strong>[[Entity Relationship Diagram (ERD)|Relationships]]</strong>: How do these entities connect?</li>
<li><strong>[[The Goal of Business Intelligence|What the business needs]]</strong>: What user insights they cannot know today, but lie in the source system provided, or combined with other data sources.</li>
</ul>
<p>It&rsquo;s the difference between a data warehouse and a data dump.</p>
<p>But here&rsquo;s the thing: I believe AI will bring us back to the fundamentals. When AI-generated code breaks and you&rsquo;re out of context, what then? That&rsquo;s where the fundamentals save you. Data modeling. Understanding the grain. Knowing SQL deeply, not superficially<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>.</p>
<p>Someone needs to understand and refactor generated code. Someone needs to simplify. That someone is you.</p>
<h2 id="the-lost-code-we-inherit">The Lost Code We Inherit</h2>
<p>You&rsquo;ll inherit code from someone who left. Everyone does.</p>
<p>I once found a DAG called <code>final_v3_FIXED_REAL_FINAL.py</code>. Inside was a comment:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Mike: I don&#39;t know why this works. Just leave it</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Mike was right. I left it.</p>
<p><strong>The biggest pitfall?</strong> Trying to recreate everything to your taste. Accept the legacy. Adapt or improve one thing at a time. The motto &ldquo;Don&rsquo;t touch what works today&rdquo; really applies to legacy code most often.</p>
<p>Usually, the previous engineer wasn&rsquo;t naive or stupid. They were solving different problems with different constraints. Your job isn&rsquo;t to make it beautiful (sometimes it helps!). Your job is to keep it running while slowly making it better over time.</p>
<h2 id="the-books-that-actually-matter">The Books That Actually Matter</h2>
<p>As the cycles come and go, these books helped me throughout the cycle<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>, and can be applied to this day.</p>
<p><strong><a href="https://unidel.edu.ng/focelibrary/books/Designing%20Data-Intensive%20Applications%20The%20Big%20Ideas%20Behind%20Reliable,%20Scalable,%20and%20Maintainable%20Systems%20by%20Martin%20Kleppmann%20%28z-lib.org%29.pdf" target="_blank" rel="noopener noreffer">Designing Data-Intensive Applications</a></strong> by Martin Kleppmann about distributed systems and how to build them. Wait a little, version two is just around the corner.</p>
<p><strong>[[The Data Warehouse Toolkit (Ralph Kimball)|The Data Warehouse Toolkit]]</strong> by Ralph Kimball. Someone in 2045 will still need to understand fact tables and dimensional tables.</p>
<p><strong><a href="https://www.amazon.com/Fundamentals-Data-Engineering-Robust-Systems/dp/1098108302" target="_blank" rel="noopener noreffer">Fundamentals of Data Engineering</a></strong> by Joe Reis and Matt Housley. A great start to know about all the concepts and principles you hear everywhere, including in this article.</p>
<p><strong><a href="https://dedp.online/" target="_blank" rel="noopener noreffer">Patterns of Data Engineering (PoDE)</a></strong> by me. If you want, you can also read my unfinished online book, which starts with the state of the art, the history and key <a href="https://dedp.online/part-1/2-overview-dedp/understanding-convergent-evolution.html" target="_blank" rel="noopener noreffer">convergent evolution in data engineering</a>, about the ever-returning cycle we talk about here, and explains them with higher-level patterns.</p>
<p>The first two books don&rsquo;t mention Snowflake, Lakehouse or dbt. They mention problems that existed in 1995 and will exist in 2045. That&rsquo;s the Lindy Effect, and how you know they&rsquo;re worth reading.</p>
<h2 id="what-i-know-now-that-i-wish-i-knew-then">What I Know Now That I Wish I Knew Then</h2>
<p>If I could go back to 2003 and talk to my younger self, here&rsquo;s what I might say. Boy:</p>
<p><strong>1. The tools will change. The fundamentals won&rsquo;t.</strong></p>
<p>Stop chasing every new framework. Learn [[data modeling]]. Learn how data is flowing. Learn SQL deeply, not superficially. Learn how humans make decisions. Everything else is syntax.</p>
<p>In 2026, AI helps us write code faster. But someone still needs to understand the fundamentals, the [[Data Engineering Lifecycle]]. That someone can be you.</p>
<p><strong>2. Talk to the business people.</strong></p>
<p>This is a crucial lesson in my journey. What you&rsquo;ll learn from them will make you inevitably a better data engineer. Technical skills can be learned, outsourced, automated. Knowledge about the business is much harder.</p>
<p>The best data engineers aren&rsquo;t the ones who know every new tool. They&rsquo;re the ones who know <em>why</em> the data matters.</p>
<p><strong>3. You&rsquo;re building the foundation, not the showcase.</strong></p>
<p>When the presentation goes well, the data scientist, the AI engineer gets the credits. When the executive makes the right decision, the analyst gets credit. When the dashboard loads fast, nobody says anything.</p>
<p>But when one number is wrong? Everyone sees you.</p>
<p>Accept this. You&rsquo;re a plumber. Be the <strong>best plumber in the world</strong> and make sure nobody ever thinks about the pipes.</p>
<p><strong>4. Data quality is learned through pain.</strong></p>
<p>You can&rsquo;t understand data quality from a textbook. You need to see bad data. If you start looking, it won&rsquo;t take long, and you&rsquo;ll see really bad, production-breaking data. That will teach you what good data looks like.</p>
<p>And you&rsquo;ll only get faster by talking to the people who use it.</p>
<p><strong>5. Presentation matters more than you think.</strong></p>
<p>No matter how fancy your pipeline, how elegant your code, how profound your insights—if the presentation isn&rsquo;t right or the data quality is terrible, no one cares.</p>
<p>Throughout my career, presenting data understandably has been as important as building the pipeline. That&rsquo;s why these days, I focus on the <a href="https://craft.ssp.sh/" target="_blank" rel="noopener noreffer">storyline and craft</a> extensively.</p>
<p><strong>6. Set boundaries early.</strong></p>
<p>This job will take everything you give it. The people who succeed aren&rsquo;t the ones who work 80-hour weeks. Sure, in the beginning you need it here and there too. But over time, you need to learn to [[Hell Yeah or No|say no]]. Document things so you can take vacation. Build systems that don&rsquo;t require you to be online at 3 AM.</p>
<p>Future you will thank you.</p>
<p><strong>7. Don&rsquo;t chase every trend.</strong></p>
<p>Data engineering is still going strong. Stronger than ever. AI won&rsquo;t take our jobs any time soon. The opposite is true. There will be more chaos, and people who know how to model data, understand business requirements, and deliver high-quality insights will always be needed.</p>
<p>Plus, every AI solution out there needs data, a lot of data, and probably a plumber to fix the pipeline too. Use the knowledge of past years building data pipelines. We don&rsquo;t need to rebuild everything every 5 years.</p>
<h2 id="the-loop-continues">The Loop Continues</h2>
<p>It&rsquo;s 2026. I&rsquo;m building a pipeline with DuckDB and Rill. The business wants faster dashboards and better insights. They want to edit data themselves. They want to use an AI chatbot. Or sometimes they just rename a column in the source system without telling anyone.</p>
<p>Here we go again.</p>
<p>But here&rsquo;s the thing: I still love it. Especially when I can write about the learnings.</p>
<p>I don&rsquo;t miss the late nights or the schema changes or the never-ending rewrites. I love the moment when you finally get the data right and someone in finance sees something they&rsquo;ve never seen before. When a dashboard actually changes a decision. When the CEO asks a question and you can answer it with data.</p>
<p>That&rsquo;s the job. Not the tools. Not the frameworks. Not the buzzwords.</p>
<p>The moment when data helps a human make a better decision.</p>
<h2 id="the-final-truth">The Final Truth</h2>
<p>The tools will change. The vendors will rise and fall. Snowflake will be replaced by something else. The latest new shiny tool will become the legacy tomorrow. AI agents will be the next big thing, and then something after that.</p>
<p>But someone, somewhere, will always need to:</p>
<ul>
<li>Understand the grain of a business</li>
<li>Know why the numbers don&rsquo;t match</li>
<li>Explain to the CEO that the data they want doesn&rsquo;t exist yet</li>
<li>Debug why a pipeline broke at 2 AM</li>
<li>Figure out why production data looks different from dev data</li>
</ul>
<p>That someone is you.</p>
<p>You&rsquo;re the invisible plumber. The unsung engineer. The person who makes sure the foundation doesn&rsquo;t crumble while everyone else builds on <a href="https://xkcd.com/2347/" target="_blank" rel="noopener noreffer">top of it</a>.</p>
<p>And honestly? It&rsquo;s a pretty damn good job if you like to work quietly, helping a large part of the business.</p>
<p>Because 50 years from now, when we&rsquo;re using tools we can&rsquo;t even imagine today, someone will still be ingesting data, modeling it, transforming it, serving it. Someone will ask for a change. Something will break.</p>
<p>The loop continues. The problems remain. Only the tools change.</p>
<p>And that&rsquo;s okay. Isn&rsquo;t that somehow beautiful? Because beneath all the hype, all the new frameworks, all the promises of &ldquo;this time it&rsquo;s different&rdquo;—there&rsquo;s you, the data engineer 😉. Understanding the data. Knowing the business. Building the foundation.</p>
<p>That&rsquo;s <em>[[Data Engineering]]</em>.</p>
<blockquote>
<p>[!tip] Inspiration</p>
<p><em>This piece was inspired by the confessional storytelling style of <a href="https://www.youtube.com/@TheDiaryOfACEO" target="_blank" rel="noopener noreffer">Diary of a CEO</a>. If you enjoyed this format applied to data engineering, let me know—I&rsquo;d love to hear your own stories from the field.</em></p>
</blockquote>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>I wrote more at [[Will AI replace Humans|Will AI Replace Human Thinking]]&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>I collect [[Books of Data Engineering]] at my data engineering brain, find more interesting once there too.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></description>
</item>
</channel>
</rss>
