Open-Source Data Engineering Projects

Last updatedUpdated: Mar 4, 2026 by Simon Späti · CreatedCreated: Aug 2, 2022

Opendatastack

This note is for data engineers and developers. Here are some open-source data engineering projects that you can explore:

# My Projects

Real estate dagster pipeline: A practical data engineering project for processing real estate data. Accompanied by a blog article: Building a Data Engineering Project in 20 Minutes.
Open Data Stack Projects: Examples of end-to-end data engineering projects using the Open Data Stack (e.g. dbt, Airbyte, Dagster, Metabase/Rill).
Airbyte Monitoring with dbt and Metabase: Monitoring Airbyte with dbt and Metabase. GitHub Code
«Open Enterprise Data Platform»: Integrates the prowess of open-source tools into a unified, enterprise-grade data platform. It simplifies end-to-end data engineering by converging tools like dbt, Airflow, and Superset, anchored on a robust Postgres database.
Example Pipeline with Airflow KubernetesPodOperator and dbt: Downloads ~150 CSVs, inserts into Postgres, and runs dbt. Everything is runnable with Astro CLI. A good example of how to use KubernetesPodOperator locally and in production.
Data engineering Kestra example with dlt and Snowflake: Learn to build enterprise data pipelines using Kestra orchestrator, dlt, and Snowflake. Covers secrets management, git sync, and AI integration.
Testing Boring Semantic Layer with DuckDB: Creating a simple Semantic Layer with DuckDB.
Data Modeling Guide: ClickHouse ETL and Rill with NOAA Weather Data Example: Learn how to build sub-second real-time analytics with ClickHouse. Complete guide covering data modeling strategies, optimization techniques, and practical S3-to-dashboard examples.
Cloud Cost Analyzer: An open-source framework for multi-cloud cost visibility. Extendable with dlt. Accompanied Blog Post.
💎 Obsidian Note Taking Assistant: Local first Second Brain RAG for any Obsidian Vault (works with any Markdown). Works with DuckDB. Also includes a web-app. Blog, GitHub

# Other Projects

Name	Description
ELT with Dagster	An ELT (Extract, Load, Transform) project using Dagster.
Data Engineering Project for Beginners - Batch edition	A beginner-friendly data engineering project focusing on batch processing.
HashtagCashtag	A project related to analyzing hashtags and cashtags.
Github Data Analytics	Analyzing GitHub repositories with a large amount of code.
Crawling For Inflation	A project focused on crawling data for analyzing inflation.
New Generation Opensource Data Stack Demo	A demonstration of the new generation open-source data stack using Iceberg, Spark, Trino, and Dagster. Accompanied by an article: Modern, open-source data stack demo, calling the modern data stack ngods (new generation open-source data stack).
RustCheatersDataPipeline	A data pipeline that scrapes Rust cheater Steam profiles.
Monte Carlo simulation of the NBA season	A project leveraging meltano, dbt, duckdb, and superset to simulate the NBA season. Accompanied by a blog article: Modern Data Stack in a Box with DuckDB.
Tinkering around with a bunch of open source data tools	Exploring various open-source data tools.
Build a poor man’s data lake from scratch with DuckDB	A tutorial on building a data lake from scratch using DuckDB.
Repo for orienting dbt users to the Dagster asset framework	A repository to introduce dbt users to the Dagster asset framework.
Stock Market Real-Time Data Analysis Using Kafka	An end-to-end data engineering project utilizing Kafka for real-time stock market data analysis. Accompanied by the GitHub repository: stock-market-kafka-data-engineering-project.
LinkedIn post	A LinkedIn post related to data engineering projects.
Building blocks for Apache Airflow	A collection of building blocks for Apache Airflow available in the Astronomer Registry.
Example end-to-end data engineering project	An example project showcasing end-to-end data engineering.
Fake Star Detector	A project utilizing Dagster to detect fake stars.
Demo data stack using Dagster, Airbyte, and dbt	A demo data stack using Dagster, Airbyte, and dbt.
Data Engineering Practice Problems	Practice problems for data engineering.
News ETL CAPSTONE project for Data Engineering Bootcamp	A news ETL (Extract, Transform, Load) project serving as a capstone project for a Data Engineering Bootcamp.
Shopify ETL	An ETL project specifically designed for Shopify.
Weather app with Timeseries data using AWS, TDengine, Docker and Grafana	Load time-series data by Andreas Kretz, see the Video.
Query your Delta Lake table with Apache Arrow and Datafusion using ROAPI	Build an Apache Spark image with Delta Lake installed, run a container, and follow the quickstart in an interactive notebook or shell with any of the options like Python, PySpark, Scala Spark or even Rust.
Open and local-first data hub for Filecoin	Contains code and artifacts to help process Filecoin data from on-chain and off-chain sources.
Real Estate Prices ETL	ETL to scrape a real estate website, process house prices and data, and build an ML model of the house prices (based on my real estate project).
MDS Fest Demo	by Pedram Navid: Demo Project for an Open Data Stack. Also check the YouTube Video. ^2f7d8e
Stream processing pipeline from Finnhub websocket using Spark, Kafka, Kubernetes	Streaming data pipeline based on Finnhub.io API/websocket real-time trading data created for a sake of my master’s thesis related to stream processing. Showcasing key aspects of streaming pipeline development & architecture, providing low latency, scalability & availability.
Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP	A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more.
GitHub - ABZ-Aaron/Reddit-API-Pipeline	A data pipeline to extract Reddit data from r/dataengineering. The output is a Google Data Studio report, providing insight into the Data Engineering official subreddit.
Pipeline that extracts data from Crinacle’s Headphone and InEarMonitor databases and finalizes data for a Metabase Dashboard.	Scrape data from Crinacle’s website to generate bronze data. Load bronze data to AWS S3. Initial data parsing and validation through Pydantic to generate silver data. Load silver data to AWS S3. Load silver data to AWS Redshift. Load silver data to AWS RDS for future projects and transform and test data through dbt in the warehouse.
Surfline Dashboard	The pipeline collects data from the surfline API and exports a csv file to S3. Then the most recent file in S3 is downloaded to be ingested into the Postgres datawarehouse. A temp table is created and then the unique rows are inserted into the data tables. Airflow is used for orchestration and hosted locally with docker-compose and mysql. Postgres is also running locally in a docker container. The data dashboard is run locally with ploty.
SunglassesHub ETL	This project is for demonstrating knowledge of Data Engineering tools and concepts and also learning in the process.
ETL-Houseslima-Properati	This project creates a pipeline that takes data from Properati web page ( Properati is a real estate search site), processes it using lambda functions and, finally, stores it in a redshift database. This pipeline is orchestrated using AWS Step Functions and scheduled with AWS EventBridge. Also, we’ll build a FLASK REST API to interact with the database. This Flask App allow us to retrieve data and is hosted in AWS Lightsail container service.
Uber Data Analytics: Modern Data Engineering GCP Project	Performs data analytics on Uber data using various tools and technologies, including GCP Storage, Python, Compute Instance, Mage Data Pipeline Tool, BigQuery, and Looker Studio. Video Link
Beginner DE Project - Batch Edition	Orchestration with Airflow. Classifying movie reviews with Apache Spark. Loading the classified movie reviews into the data warehouse. Extract user purchase data from an OLTP database and load it into the data warehouse. Joining the classified movie review data and user purchase data to get user behavior metric data. Blog Article
Building Spotify ETL using Python and Airflow	How to create a simple ETL(Extract, Transform, Load) pipeline using Python and automate the process through Apache airflow.
Sample project to demonstrate data engineering best practices	Assume we are extracting customer and order information from upstream sources and creating a daily report of the number of orders. Blog
Quickstarts by Airbyte	A collection of DE projects with various templates to help you quickly build your data stack tailored to different domains like Marketing, Product, Finance, Operations, and more.
Free Course with Examples by DataTalksClub	You learn containerization and infrastructure as code, workflow orchestration, data ingestion, data warehouse, analytics engineering, batch processing, and streaming.
ELT (Extract, Load, Transform) pipeline	A pipeline using Polars to store data in DeltaTables within local S3 object storage (MinIO), orchestrated with Airflow.
Dagster scikit-learn Pipeline Example	A trivial, contrived example of using Dagster to build a scikit-learn pipeline.
Local Dev, Cloud Prod with Dagster, MotherDuck and Evidence	Demo project that showcases Dagster, Motherduck and Evidence (WebAssembly + DuckDB). See the YT-video too.
Multistack dbt, snowflake, duckdb and Iceberg	Proof of concept on how to use dbt with iceberg to run a multi-stack environment.
PyPi DuckDB Flow	End-to-end data engineering project to get insights from PyPi using python and duckdb.
Personalizing Warcraft Logs and Building a “Personal Project” Stack	Lior takes us on a ride with building his internal ranking system for my World of Warcraft based on his “personal project” stack as he likes to call it using DuckDB, Python, and dbt. The project extracts data from Warcraft Logs (GraphQL API), loads it into DuckDB, and transforms it into meaningful dimensions and facts. This is an awesome use-case as it’s end-to-end with loading from an API, storing it and processing data in a meaningful way for the presentation layer. He uses Streamlit to create visuals that are hosted to be seen.
Efficient Data Processing in Spark" Course	Code for “Efficient Data Processing in Spark” Course including Delta, Minio (S3 replica), Jupyter Notebook, Postgres, Handy Make commands.
DuckDB Tutorial: Building AI Projects	This tutorial guides you through DuckDB’s key features and practical applications, including building tables, performing data analysis, building an RAG application, and using an SQL query engine with LLM.
Orchestrate Modal and OpenAI workloads with Dagster - YT Video	Detects newly published podcasts, transcribes them using the power of GPUs, and notifies you with the summary so that you can focus on the things that truly matter.
1. A declarative programming ELT Project with Kestra 2. Serverless Data Pipelines	1. A self-contained project exploring a full ELT project using Kestra, dbt, DuckDB, Neon Postgres and Resend. 2. A serverless Data Pipelines with Kestra, Modal, dbt, and BigQuery.
Cost Efficient Data Pipelines with DuckDB	Alternative to high-load, cluster systems like Kafka, Apache Spark, and Snowflake example. Blog
DuckDB Medallion Architecture Pipeline	Demonstrating the capabilities of DuckDB as a transformation engine for data lakes
Datafold dbt CI advanced	Four integrations to your dbt CI pipeline: Slim CI, pre-commit hooks, Data Diffs, and Slack notifications
Finnhub Streaming Data Pipeline	Stream processing pipeline from Finnhub websocket using Spark, Kafka, Kubernetes and more.
From Inbox to Insights: AI-enhanced email analysis with dlt and Kestra	A demo project that shows the automation of a workflow in Kestra. It demonstrates the process of loading data from Gmail into BigQuery using dlt (Data Loading Tool) and includes AI analysis, specifically for summarization and sentiment asessment.
Open Data Stack Platform: a collection of projects and pipelines built with open data stack tools	Build a modern data analytics and machine learning platform that provides a scalable foundation for ingesting, transforming, and analyzing data. Using NYC taxi trip data as a practical example, this platform demonstrates how to build robust data pipelines with modern tooling.
DuckHouse: DuckDB + Iceberg + Flight	Building a DWH on top of DuckDB?
Data Engineering for Beginners Code	Capstone project that puts together all the in-demand tools, as shown below
MCP server for querying Apple Health data with natural language and SQL	MCP server for querying Apple Health with DuckDB
DWH-to-go	A “one-click” data platform deployed with Terraform. The projects aims to create a reusable, cost-effective setup that leverages free-tier offerings from various providers such as AWS (12-month Free Tier), Snowflake (30-day Free Trial), dbt Cloud (Free Developer Plan for one project) and Metabase (running in a container).
End2End-Data-Pipeline	An end-to-end, containerized data pipeline for near-real-time user event analytics using Kafka, ClickHouse, Airflow, and PySpark.
Local Data Stack	This project demos how to integrate duckdb and delta lake with dbt, orchestrated by Dagster.
AiAgentPDFtoJSON	Building an AI Agent data pipeline to process PDF -> JSON files with local Ollama by Daniel Beach. Accompanied Blog Post
Portable Analytics Stack	Ephemeral compute, shared state, and object storage with no warehouse required. A practical portable stack combining several tools with dlt, SQLMesh, DuckDB, and DuckLake on Cloudflare R2
Agentic data Stack	The Official ClickHouse Agentic Data Stack with self-host with ClickHouse, LibreChat, Langfuse, and MCP. Find accompanied blog post.

# Personal Finance Data Projects

Repo to set up automated finances: This project is dedicated to setting up automated personal finance management.
Elias Benaddou Idrissi’s Data Engineering Project on Monzo: A detailed look into a data engineering project focusing on Monzo, providing insights and methodologies.
Personal Finance Project to automatically collect Swiss banking transactions into a DWH and visualize it: A project aimed at automating the collection of Swiss banking transactions into a Data Warehouse and providing visualization capabilities.
datadex: Bring the modern data stack to Open Data: Datadex focuses on bringing the modern data stack into the realm of Open Data, showcasing innovative approaches and techniques.

# Templates

l-mds/local-data-stack: Serve as a template of how to get started with a local modern data stack.

# GitHub Lists

DE projects: A curated list of data engineering projects by me on GitHub.
Awesome List of Open-Source Data Engineering Projects: An awesome list of open-source data engineering projects by Gunnar Morling.
Data Engineering Handbook: Lists helpful resources such as certification courses, Communities, Conferences, Data Engineering Whitepapers, Great Podcasts, Great YouTube Channels, Great books, Newsletters, and People from LinkedIn & Twitter.
Dagster Open Platform: Contains productive dagster assets that are used by the Dagster team. Showcasing what a high-growth startup used for data pipelines.

We are open-sourcing these pipelines as an educational reference for how we do data engineering, build software, and run our analytics.

# Platforms or Framework

YETL: Yet Another (Spark) ETL Framework.
Bridge Four : A simple, functional, effectful, single-leader, multi-worker, distributed compute system optimized for embarrassingly parallel workloads.

# Articles

Big Data Pipeline Recipe: A recipe for building big data pipelines.
Top 12 Data Engineering Project Ideas: In-depth data engineering projects with source code included.

# Youtube

5 Data Engineering Projects Ideas To Put On Your Resume: A video suggesting five data engineering project ideas that can enhance your resume.
How To Start A Data Engineering Project - With Data Engineering Project Ideas: A video guide on how to start a data engineering project along with project ideas.

# Bootcamps

Refer to the article on Learning Data Engineering for information on relevant bootcamps.

For additional resources, check out Find good Data Sets or Sources and Open-Source ML Projects.

Origin: