Open-Source Data Engineering Projects
This note is for data engineers and developers. Here are some open-source data engineering projects that you can explore:
# My Projects
- Real estate dagster pipeline: A practical data engineering project for processing real estate data. Accompanied by a blog article: Building a Data Engineering Project in 20 Minutes.
- Open Data Stack Projects: Examples of end-to-end data engineering projects using the Open Data Stack (e.g. dbt, Airbyte, Dagster, Metabase/Rill).
- Airbyte Monitoring with dbt and Metabase: Monitoring Airbyte with dbt and Metabase. GitHub Code
# Other Projects
- ELT with Dagster: An ELT (Extract, Load, Transform) project using Dagster.
- Data Engineering Project for Beginners - Batch edition: A beginner-friendly data engineering project focusing on batch processing.
- HashtagCashtag: A project related to analyzing hashtags and cashtags.
- Github Data Analytics: Analyzing GitHub repositories with a large amount of code.
- Crawling For Inflation: A project focused on crawling data for analyzing inflation.
- Data Engineer Roadmaps as Projects Funnel by Mehdi Ouazza: Roadmaps for data engineers presented as projects.
- New Generation Opensource Data Stack Demo: A demonstration of the new generation open-source data stack using Iceberg, Spark, Trino, and Dagster. Accompanied by an article: Modern, open-source data stack demo, calling the modern data stack ngods (new generation open-source data stack).
- RustCheatersDataPipeline: A data pipeline that scrapes Rust cheater Steam profiles.
- Monte Carlo simulation of the NBA season: A project leveraging meltano, dbt, duckdb, and superset to simulate the NBA season. Accompanied by a blog article: Modern Data Stack in a Box with DuckDB.
- Tinkering around with a bunch of open source data tools: Exploring various open-source data tools.
- Build a poor man’s data lake from scratch with DuckDB: A tutorial on building a data lake from scratch using DuckDB.
- Repo for orienting dbt users to the Dagster asset framework: A repository to introduce dbt users to the Dagster asset framework.
- Stock Market Real-Time Data Analysis Using Kafka: An end-to-end data engineering project utilizing Kafka for real-time stock market data analysis. Accompanied by the GitHub repository: stock-market-kafka-data-engineering-project.
- LinkedIn post: A LinkedIn post related to data engineering projects.
- Building blocks for Apache Airflow: A collection of building blocks for Apache Airflow available in the Astronomer Registry.
- Example end-to-end data engineering project: An example project showcasing end-to-end data engineering.
- Fake Star Detector: A project utilizing Dagster to detect fake stars.
- Demo data stack using Dagster, Airbyte, and dbt: A demo data stack using Dagster, Airbyte, and dbt.
- Data Engineering Practice Problems: Practice problems for data engineering.
- News ETL CAPSTONE project for Data Engineering Bootcamp: A news ETL (Extract, Transform, Load) project serving as a capstone project for a Data Engineering Bootcamp.
- Shopify ETL: An ETL project specifically designed for Shopify.
- Weather app with Timeseries data using AWS, TDengine, Docker and Grafana: Load time-series data by Andreas Kretz, see the Video.
- Query your Delta Lake table with Apache Arrow and Datafusion using ROAPI: Build an Apache Spark image with Delta Lake installed, run a container, and follow the quickstart in an interactive notebook or shell with any of the options like Python, PySpark, Scala Spark or even Rust.
- Open and local-first data hub for Filecoin: Contains code and artifacts to help process Filecoin data from on-chain and off-chain sources.
- Real Estate Prices ETL: ETL to scrape a real estate website, process house prices and data, and build an ML model of the house prices (based on my real estate project).
- MDS Fest Demo by Pedram Navid: Demo Project for an Open Data Stack. Also check the YouTube Video. ^2f7d8e
- Stream processing pipeline from Finnhub websocket using Spark, Kafka, Kubernetes: Streaming data pipeline based on Finnhub.io API/websocket real-time trading data created for a sake of my master’s thesis related to stream processing. Showcasing key aspects of streaming pipeline development & architecture, providing low latency, scalability & availability.
- Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP: A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more.
- GitHub - ABZ-Aaron/Reddit-API-Pipeline: A data pipeline to extract Reddit data from r/dataengineering. The output is a Google Data Studio report, providing insight into the Data Engineering official subreddit.
- Pipeline that extracts data from Crinacle’s Headphone and InEarMonitor databases and finalizes data for a Metabase Dashboard.: Scrape data from Crinacle’s website to generate bronze data. Load bronze data to AWS S3. Initial data parsing and validation through Pydantic to generate silver data. Load silver data to AWS S3. Load silver data to AWS Redshift. Load silver data to AWS RDS for future projects and transform and test data through dbt in the warehouse.
- Surfline Dashboard: The pipeline collects data from the surfline API and exports a csv file to S3. Then the most recent file in S3 is downloaded to be ingested into the Postgres datawarehouse. A temp table is created and then the unique rows are inserted into the data tables. Airflow is used for orchestration and hosted locally with docker-compose and mysql. Postgres is also running locally in a docker container. The data dashboard is run locally with ploty.
- SunglassesHub ETL: This project is for demonstrating knowledge of Data Engineering tools and concepts and also learning in the process.
- ETL-Houseslima-Properati: This project creates a pipeline that takes data from Properati web page ( Properati is a real estate search site), processes it using lambda functions and, finally, stores it in a redshift database. This pipeline is orchestrated using AWS Step Functions and scheduled with AWS EventBridge. Also, we’ll build a FLASK REST API to interact with the database. This Flask App allow us to retrieve data and is hosted in AWS Lightsail container service.
- Uber Data Analytics | Modern Data Engineering GCP Project: Performs data analytics on Uber data using various tools and technologies, including GCP Storage, Python, Compute Instance, Mage Data Pipeline Tool, BigQuery, and Looker Studio. Video Link
- Beginner DE Project - Batch Edition: Orchestration with Airflow. Classifying movie reviews with Apache Spark. Loading the classified movie reviews into the data warehouse. Extract user purchase data from an OLTP database and load it into the data warehouse. Joining the classified movie review data and user purchase data to get user behavior metric data. Blog Article
- Building Spotify ETL using Python and Airflow: How to create a simple ETL(Extract, Transform, Load) pipeline using Python and automate the process through Apache airflow.
# GitHub Lists
- DE projects: A curated list of data engineering projects by sspaeti.
- Awesome List of Open-Source Data Engineering Projects: An awesome list of open-source data engineering projects by Gunnar Morling.
# Platforms or Framework
- YETL: Yet Another (Spark) ETL Framework.
- Bridge Four : A a simple, functional, effectful, single-leader, multi-worker, distributed compute system optimized for embarrassingly parallel workloads.
- Big Data Pipeline Recipe: A recipe for building big data pipelines.
- Top 12 Data Engineering Project Ideas: In-depth data engineering projects with source code included.
- 5 Data Engineering Projects Ideas To Put On Your Resume: A video suggesting five data engineering project ideas that can enhance your resume.
- How To Start A Data Engineering Project - With Data Engineering Project Ideas: A video guide on how to start a data engineering project along with project ideas.
Refer to the article on Learning Data Engineering for information on relevant bootcamps.