🧠 Second Brain
Search
Data Engineering Whitepapers
A curated list of influential whitepapers in the field of data engineering.
- Data Lakehouse: Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics ^6a7f75
- Data Catalog: Ground: A Data Context Service
- Apache Spark: Spark: Cluster Computing with Working Sets
- Data Engineering Architecture: The Google File System
- Streaming: The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
- Google File System (GFS): The Google File System ^fdbf43
- MapReduce: MapReduce: Simplified Data Processing on Large Clusters
- Data Warehousing: Dremel: Interactive Analysis of Web-Scale Datasets
- Data Mesh: How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
- DuckDB:
MotherDuck: DuckDB in the cloud and in the client- A paper that introduces the 1-5-Tier Architecture. ^9f7136
- Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age
- Unnesting Arbitrary Queries: Unnest subqueries queries in SQL.
- SQL - Spanner: Becoming a SQL System: Google Spanner evolved from a distributed key-value store into a full SQL database system ^786591
- Analytics Development Lifecycle (ADLC): The Analytics Development Lifecycle (ADLC) by dbt
- Vectorized Engine: Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask. ^db4178
- Redshift Files: Why TPC Is Not Enough: An Analysis of the Amazon Redshift Fleet ^2fbdb2
- Dremel Encoding: Dremel: Interactive Analysis of Web-Scale Datasets (Year 2011) used in Apache Parquet, and Apache Drill (which was seeded into Apache Arrow). ^51fc46
- Bluesky: The AT Protocol: Usable Decentralized Social Media with Martin Kleppmann ^039de7
- HILDA: Human-in-the-Loop Data Analysis: A Personal Perspective ^678997
- Schema Evolution:
- NoSQL
- Relational Model:
- A Relational Model of Data for Large Shared Data Banks, 1970 by Edgar F. Codd ^8d1ede
-
C-Store: A Column-oriented DBMS: Organizing data by columns rather than rows. 2005 ^fe90b8
- MonetDB/X100: Hyper-Pipelining Query Execution. The beginning of Vectorized Query Execution by MonetDB ^4e54dd
# AI Related Papers
- LLM:
- Anthropic: Constitutional AI: Harmlessness from AI Feedback
- Rest meets react: self-improvement for multi-step reasoning llm agent by Google
- PaLM 2 Technical Report - Google - Demonstrates advanced capabilities in processing structured data and numerical reasoning
- RAG:
Find more useful content on Data Engineering Vault.
Origin:
References:
Created 2024-01-05