Search
Data Engineering Whitepapers
A curated list of influential whitepapers in the field of data engineering.
# Data Lakehouse
Data Lakehouse combines the best of data warehouses and data lakes into a single architecture.
- Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics ^6a7f75
- Building a Database on S3 (2008) - Before Open Table Formats
# Distributed Systems & Storage
Foundational papers on distributed systems that power modern data infrastructure at scale.
- Google File System (GFS): The Google File System ^fdbf43
- Data Engineering Architecture: The Google File System (2003)
- MapReduce: MapReduce: Simplified Data Processing on Large Clusters
# Data Warehousing & OLAP
Data warehouses and OLAP systems optimized for analytical queries over large datasets.
- Data Warehousing: Dremel: Interactive Analysis of Web-Scale Datasets
- Dremel Encoding: Dremel: Interactive Analysis of Web-Scale Datasets (Year 2011) used in Apache Parquet, and Apache Drill (which was seeded into Apache Arrow). ^51fc46
- DWA: DWA Whitepapers
- Redshift Files: Why TPC Is Not Enough: An Analysis of the Amazon Redshift Fleet ^2fbdb2
- OLAP - ClickHouse: Lightning Fast Analytics for Everyone
- The Snowflake Elastic Data Warehouse: “The mission was to build an enterprise-ready data warehousing solution for the cloud. The result is the Snowflake Elastic Data Warehouse, or “Snowflake” for short.”
-
Scuba: Diving into Data at Facebook, a paper about Meta Scuba (2013) ^d3683f
- Followed up with Meta’s Next-generation Realtime Monitoring and Analytics Platform (2022) called Kraken: Historically, Meta Scuba has favored system availability along with speed and freshness of results over data completeness and durability. While these choices allowed Scuba to grow from terabyte scale to petabyte scale and continue onboarding a variety of use cases, they also came at an operational cost of dealing with incomplete data and managing data loss. ^e7bf40
# Processing Engines
Data processing frameworks for batch processing and streaming computation.
- Apache Spark: Spark: Cluster Computing with Working Sets
- Streaming: The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
- Vectorized Engine: Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask ^db4178
- Theseus: A Distributed and Scalable GPU-Accelerated Query Processing Platform Optimized for Efficient Data Movement built by Voltron Data ^932b35
- End-to-End Declarative Data Analytics: Co-designing Engines, Interfaces, and Cloud Infrastructure: An early prototype by ETH shows how rethinking the interface between the engine and the cloud platform enables elastic data-dependent parallel execution over data lakes, automatic caching, and opens new research directions for cloud analytics. ^b7bd10
- Watermarks: MillWheel: Fault-Tolerant Stream Processing at Internet Scale by Google. MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework’s fault-tolerance guarantees ^287948
# DuckDB
DuckDB got his own categories as single-file OLAP database.
- MotherDuck: DuckDB in the cloud and in the client - A paper that introduces the 1-5-Tier Architecture. ^9f7136
- Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age
- Unnesting Arbitrary Queries: Unnest subqueries queries in SQL.
- Beyond Quacking: Deep Integration of Language Models and RAG into DuckDB: This paper introduces FlockMTL, an extension for DuckDB that deeply integrates Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) capabilities into database management systems. See an implementation example
-
Don’t Hold My Data Hostage – A Case For Client Protocol Redesign: Showcasing how large amount of data from a database to a client program is a surprisingly expensive operation and the difference between the clients. ^caa24f
# SQL
All about SQL, the domain-specific language to query databases and more.
- Spanner: Becoming a SQL System: Google Spanner evolved from a distributed key-value store into a full SQL database system ^786591A
- SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL - Pipe Syntax In SQL ^663863
- What Goes Around Comes Around by Michael Stonebraker, Joey Hellerstein: This paper provides a summary of 35 years of data model proposals, grouped into 9 different eras. WIth a proposals of each era, and show that there are only a few basic data modeling ideas, and most have been around a long time. ^37c66e
- Semantic Data Modeling, Graph Query, and SQL, Together at Last?: Semantic data models express high-level business concepts and metrics, capturing the business logic needed to query a database correctly by Jeff Shute, Colin Zheng and Romit Kudtarkar. CIDR (2026) (to appear) ^e1b169
# Relational Model
Relational databases organize data into tables with rows and columns, pioneered by Edgar F. Codd.
- A Relational Model of Data for Large Shared Data Banks, 1970 by Edgar F. Codd ^8d1ede
-
C-Store: A Column-oriented DBMS: Organizing data by columns rather than rows. 2005 ^fe90b8
- MonetDB/X100: Hyper-Pipelining Query Execution. The beginning of Vectorized Query Execution by MonetDB ^4e54dd
# NoSQL
NoSQL databases trade relational guarantees for horizontal scalability and flexible schemas.
- Bigtable: A Distributed Storage System for Structured Data, 2006
- Dynamo: Amazon’s Highly Available Key-value Store, 2007
# Schema Evolution
Schema Evolution addresses how databases handle changes to data structures over time.
- 1995: A survey of schema versioning issues for database systems by John F. Roddick ^eb6505
- 2012: Automating the database schema evolution process
# Data Architecture & Governance
Patterns for organizing data assets, governance, and discovery across the enterprise.
- Data Mesh: How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
- Data Catalog: Ground: A Data Context Service
- Semantic Layer: Measures in SQL by Julian Hyde and John Fremlin. Related: Extending SQL for analytics ^f00ba4
- Analytics Development Lifecycle (ADLC): The Analytics Development Lifecycle (ADLC) by dbt
# Git for Data
Git for Data is version control concepts applied to Datasets and data pipelines.
- Reproducible data science over data lakes: Replayable data pipelines with Bauplan and Nessie ^a13d97
- Git for Data Paper by XetData: Proposes a system that extends Git to efficiently handle terabyte-scale machine learning datasets through content-defined chunking and deduplication. ^2e514f
- Building a Correct-by-Design Lakehouse: Bauplan’s paper on Git for Data, and building Lakehouse. It contains Data Contracts, Versioning, and Transactional Pipelines for Humans and Agents
# Data Visualization
Data Visualization in data engineering and BI is so imporant.
- M4: A Visualization-Oriented Time Series Data Aggregation: Visual analysis of high-volume time series data is ubiquitous in many industries, including finance, banking, and discrete manufacturing. Done by SAP in Dresden and Volker markl Tum, Berlin. ^d2b91d
Not a Whitepaper, but Related
[Grammar of Graphics]]: The Grammar of Graphics approach, popularized by tools like ggplot2 in R and Vega-light in JavaScript, is considered declarative.
# Database Extensibility & Research
Academic and industry research on database design, extensibility, and human-data interaction.
- Anarchy in the Database: A Survey and Evaluation of Database Management System Extensibility
- HILDA: Human-in-the-Loop Data Analysis: A Personal Perspective ^678997
- Bluesky: The AT Protocol: Usable Decentralized Social Media with Martin Kleppmann ^039de7
# AI Related Papers
# Other Lists
- Schedule - CMU 15-721 :: Advanced Database Systems (Spring 2023) : see linked Papers to each presentation.
- Data Engineer Handbook Whitepapers
Origin: Data Engineering Vault
