🧠 Second Brain

Search

Search IconIcon to open search

Open-Source Data Engineering Projects

Last updated Nov 13, 2024

This note is for data engineers and developers. Here are some open-source data engineering projects that you can explore:

# My Projects

# Other Projects


Name Description
ELT with Dagster An ELT (Extract, Load, Transform) project using Dagster.
Data Engineering Project for Beginners - Batch edition A beginner-friendly data engineering project focusing on batch processing.
HashtagCashtag A project related to analyzing hashtags and cashtags.
Github Data Analytics Analyzing GitHub repositories with a large amount of code.
Crawling For Inflation A project focused on crawling data for analyzing inflation.
New Generation Opensource Data Stack Demo A demonstration of the new generation open-source data stack using Iceberg, Spark, Trino, and Dagster. Accompanied by an article: Modern, open-source data stack demo, calling the modern data stack ngods (new generation open-source data stack).
RustCheatersDataPipeline A data pipeline that scrapes Rust cheater Steam profiles.
Monte Carlo simulation of the NBA season A project leveraging meltano, dbt, duckdb, and superset to simulate the NBA season. Accompanied by a blog article: Modern Data Stack in a Box with DuckDB.
Tinkering around with a bunch of open source data tools Exploring various open-source data tools.
Build a poor man’s data lake from scratch with DuckDB A tutorial on building a data lake from scratch using DuckDB.
Repo for orienting dbt users to the Dagster asset framework A repository to introduce dbt users to the Dagster asset framework.
Stock Market Real-Time Data Analysis Using Kafka An end-to-end data engineering project utilizing Kafka for real-time stock market data analysis. Accompanied by the GitHub repository: stock-market-kafka-data-engineering-project.
LinkedIn post A LinkedIn post related to data engineering projects.
Building blocks for Apache Airflow A collection of building blocks for Apache Airflow available in the Astronomer Registry.
Example end-to-end data engineering project An example project showcasing end-to-end data engineering.
Fake Star Detector A project utilizing Dagster to detect fake stars.
Demo data stack using Dagster, Airbyte, and dbt A demo data stack using Dagster, Airbyte, and dbt.
Data Engineering Practice Problems Practice problems for data engineering.
News ETL CAPSTONE project for Data Engineering Bootcamp A news ETL (Extract, Transform, Load) project serving as a capstone project for a Data Engineering Bootcamp.
Shopify ETL An ETL project specifically designed for Shopify.
Weather app with Timeseries data using AWS, TDengine, Docker and Grafana Load time-series data by Andreas Kretz, see the Video.
Query your Delta Lake table with Apache Arrow and Datafusion using ROAPI Build an Apache Spark image with Delta Lake installed, run a container, and follow the quickstart in an interactive notebook or shell with any of the options like Python, PySpark, Scala Spark or even Rust.
Open and local-first data hub for Filecoin Contains code and artifacts to help process Filecoin data from on-chain and off-chain sources.
Real Estate Prices ETL ETL to scrape a real estate website, process house prices and data, and build an ML model of the house prices (based on my real estate project).
MDS Fest Demo by Pedram Navid: Demo Project for an Open Data Stack. Also check the YouTube Video. ^2f7d8e
Stream processing pipeline from Finnhub websocket using Spark, Kafka, Kubernetes Streaming data pipeline based on Finnhub.io API/websocket real-time trading data created for a sake of my master’s thesis related to stream processing. Showcasing key aspects of streaming pipeline development & architecture, providing low latency, scalability & availability.
Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more.
GitHub - ABZ-Aaron/Reddit-API-Pipeline A data pipeline to extract Reddit data from r/dataengineering. The output is a Google Data Studio report, providing insight into the Data Engineering official subreddit.
Pipeline that extracts data from Crinacle’s Headphone and InEarMonitor databases and finalizes data for a Metabase Dashboard. Scrape data from Crinacle’s website to generate bronze data. Load bronze data to AWS S3. Initial data parsing and validation through Pydantic to generate silver data. Load silver data to AWS S3. Load silver data to AWS Redshift. Load silver data to AWS RDS for future projects and transform and test data through dbt in the warehouse.
Surfline Dashboard The pipeline collects data from the surfline API and exports a csv file to S3. Then the most recent file in S3 is downloaded to be ingested into the Postgres datawarehouse. A temp table is created and then the unique rows are inserted into the data tables. Airflow is used for orchestration and hosted locally with docker-compose and mysql. Postgres is also running locally in a docker container. The data dashboard is run locally with ploty.
SunglassesHub ETL This project is for demonstrating knowledge of Data Engineering tools and concepts and also learning in the process.
ETL-Houseslima-Properati This project creates a pipeline that takes data from Properati web page ( Properati is a real estate search site), processes it using lambda functions and, finally, stores it in a redshift database. This pipeline is orchestrated using AWS Step Functions and scheduled with AWS EventBridge. Also, we’ll build a FLASK REST API to interact with the database. This Flask App allow us to retrieve data and is hosted in AWS Lightsail container service.
Uber Data Analytics: Modern Data Engineering GCP Project Performs data analytics on Uber data using various tools and technologies, including GCP Storage, Python, Compute Instance, Mage Data Pipeline Tool, BigQuery, and Looker Studio. Video Link
Beginner DE Project - Batch Edition Orchestration with Airflow. Classifying movie reviews with Apache Spark. Loading the classified movie reviews into the data warehouse. Extract user purchase data from an OLTP database and load it into the data warehouse. Joining the classified movie review data and user purchase data to get user behavior metric data. Blog Article
Building Spotify ETL using Python and Airflow How to create a simple ETL(Extract, Transform, Load) pipeline using Python and automate the process through Apache airflow.
Sample project to demonstrate data engineering best practices Assume we are extracting customer and order information from upstream sources and creating a daily report of the number of orders. Blog
Quickstarts by Airbyte A collection of DE projects with various templates to help you quickly build your data stack tailored to different domains like Marketing, Product, Finance, Operations, and more.
Free Course with Examples by DataTalksClub You learn containerization and infrastructure as code, workflow orchestration, data ingestion, data warehouse, analytics engineering, batch processing, and streaming.
ELT (Extract, Load, Transform) pipeline A pipeline using Polars to store data in DeltaTables within local S3 object storage (MinIO), orchestrated with Airflow.
Dagster scikit-learn Pipeline Example A trivial, contrived example of using Dagster to build a scikit-learn pipeline.
Local Dev, Cloud Prod with Dagster, MotherDuck and Evidence Demo project that showcases Dagster, Motherduck and Evidence (WebAssembly + DuckDB). See the YT-video too.
Multistack dbt, snowflake, duckdb and Iceberg Proof of concept on how to use dbt with iceberg to run a multi-stack environment.
PyPi DuckDB Flow End-to-end data engineering project to get insights from PyPi using python and duckdb.
Personalizing Warcraft Logs and Building a “Personal Project” Stack Lior takes us on a ride with building his internal ranking system for my World of Warcraft based on his “personal project” stack as he likes to call it using DuckDB, Python, and dbt. The project extracts data from Warcraft Logs (GraphQL API), loads it into DuckDB, and transforms it into meaningful dimensions and facts. This is an awesome use-case as it’s end-to-end with loading from an API, storing it and processing data in a meaningful way for the presentation layer. He uses Streamlit to create visuals that are hosted to be seen.
Efficient Data Processing in Spark" Course Code for “Efficient Data Processing in Spark” Course including Delta, Minio (S3 replica), Jupyter Notebook, Postgres, Handy Make commands.
DuckDB Tutorial: Building AI Projects This tutorial guides you through DuckDB’s key features and practical applications, including building tables, performing data analysis, building an RAG application, and using an SQL query engine with LLM.
Orchestrate Modal and OpenAI workloads with Dagster - YT Video Detects newly published podcasts, transcribes them using the power of GPUs, and notifies you with the summary so that you can focus on the things that truly matter.
1. A declarative programming ELT Project with Kestra

2. Serverless Data Pipelines
1. A self-contained project exploring a full ELT project using Kestra, dbt, DuckDB, Neon Postgres and Resend.

2. A serverless Data Pipelines with Kestra, Modal, dbt, and BigQuery.
Cost Efficient Data Pipelines with DuckDB Alternative to high-load, cluster systems like Kafka, Apache Spark, and Snowflake example. Blog
DuckDB Medallion Architecture Pipeline Demonstrating the capabilities of DuckDB as a transformation engine for data lakes
Datafold dbt CI advanced Four integrations to your dbt CI pipeline: Slim CI, pre-commit hooks, Data Diffs, and Slack notifications

# Personal Finance Data Projects

# Templates

# GitHub Lists

We are open-sourcing these pipelines as an educational reference for how we do data engineering, build software, and run our analytics.

# Platforms or Framework

# Articles

# Youtube

# Bootcamps

Refer to the article on Learning Data Engineering for information on relevant bootcamps.

For additional resources, check out Find good Data Sets or Sources and Open-Source ML Projects.


Origin:
References:
Created 2022-08-02