🧠 Second Brain
Search
Data Engineering Whitepapers
A curated list of influential whitepapers in the field of data engineering.
- Data Lakehouse: Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics ^6a7f75
- Data Catalog: Ground: A Data Context Service
- Apache Spark: Spark: Cluster Computing with Working Sets
- Data Engineering Architecture: The Google File System
- Streaming: The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
- Google File System (GFS): The Google File System ^fdbf43
- MapReduce: MapReduce: Simplified Data Processing on Large Clusters
- Data Warehousing: Dremel: Interactive Analysis of Web-Scale Datasets
- Data Mesh: How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
- DuckDB:
MotherDuck: DuckDB in the cloud and in the client- A paper that introduces the 1-5-Tier Architecture. ^9f7136
- Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age
- Unnesting Arbitrary Queries: Unnest subqueries queries in SQL.
- Analytics Development Lifecycle (ADLC): The Analytics Development Lifecycle (ADLC) by dbt
- Vectorized Engine: Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask. ^db4178
- Redshift Files: Why TPC Is Not Enough: An Analysis of the Amazon Redshift Fleet ^2fbdb2
- Dremel Encoding: Dremel: Interactive Analysis of Web-Scale Datasets used in Apache Parquet. ^51fc46
- Bluesky: The AT Protocol: Usable Decentralized Social Media with Martin Kleppmann ^039de7
- HILDA: Human-in-the-Loop Data Analysis: A Personal Perspective ^678997
- Schema Evolution:
- NoSQL
# AI Related Papers
- LLM:
- Anthropic: Constitutional AI: Harmlessness from AI Feedback
- Rest meets react: self-improvement for multi-step reasoning llm agent by Google
- PaLM 2 Technical Report - Google - Demonstrates advanced capabilities in processing structured data and numerical reasoning
- RAG: Retrieval-Augmented Generation for Large Language Models: A Survey
Find more useful content on Data Engineering Vault.
Origin:
References:
Created 2024-01-05