ūü߆ Second Brain


Search IconIcon to open search

Data Fusion

Last updated Feb 9, 2024

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

DataFusion supports both an SQL and a DataFrame API for building logical query plans as well as a query optimizer and execution engine capable of parallel execution against partitioned data sources (CSV and Parquet) using threads.

DataFusion also supports distributed query execution via the Ballista (Arrow) crate.

# Use Cases

Data Fusion is a single node, not distributed (why it was donated to Apache Arrow as an in-memory technology?). If you want to distribute it, you can use Ray.io - Apache Spark Alternative for Python as Andy Grove said in Andy Grove, or Ballista (Arrow).

According to Denny, it’s so fast, that you almost do not need distribution, but you can easily get it with Ray. Also as e.g. delta-rs is built in Rust, it’s so tiny, that he can use a Lambda function with it, which makes it extremely powerful. Imagine the difference to the Java JVM version, which comes with so much overhead and one reason Spark is so complex I guess.

From the GitHub Repo:
DataFusion can be used without modification as an embedded SQL engine or can be customized and used as a foundation for building new systems. Here are some examples of systems built using DataFusion:

By using DataFusion, the projects are freed to focus on their specific features, and avoid reimplementing general (but still necessary) features such as an expression representation, standard optimizations, execution plans, file format support, etc.

# Why DataFusion?

# Comparisons with other projects

Here is a comparison with similar projects that may help understand when DataFusion might be be suitable and unsuitable for your needs:

# Who is using it?

More from know-users

# Apple and DataFusion

Apple donated a Apache Spark replacement with DataFusion.

TIL about Apache DafaFusion Comet. Apple has replaced Spark’s guts with Arrow DataFusion. And they’re donating it. This is an alternative to @MetaOpenSource ’s Facebook Velox Spark implementation. Tweet by Chris Riccomini.

Initial PR by sunchao · Pull Request #1 · apache/arrow-datafusion-comet · GitHub and Apache Mail Archives

References: ROAPI Apache Arrow Rust
Created: 2021-10-14