🧠 Second Brain

Ballista (Arrow)

Last updated Feb 20, 2024

Ballista: Distributed SQL Query Engine, built on Apache Arrow. Competes with Apache Spark. It’s basically a scheduler for DataFusion on top of Apache Arrow.

Ballista is a distributed compute platform primarily implemented in Rust and powered by Apache Arrow. It is built on an architecture that allows other programming languages to be supported as first-class citizens without paying a penalty for serialization costs.

The foundational technologies in Ballista are:

Apache Arrow memory model and compute kernels for efficient processing of data.
Apache Arrow Flight Protocol for efficient data transfer between processes.
Google Protocol Buffers for serializing query plans.
DataFusion for query execution.

# How it works

Ballista has a scheduler and an executor process that are standard Rust executables and can be executed directly, but Dockerfiles are provided to build images for use in containerized environments, such as Docker, Docker Compose, and Kubernetes. See the deployment guide for more information

SQL and DataFrame queries can be submitted from Python and Rust, and SQL queries can be submitted via the Apache Arrow Flight Protocol SQL JDBC driver, supporting your favorite JDBC compliant tools such as DataGrip or tableau. For setup instructions, please see the FlightSQL guide.

# How does Ballista (Arrow-Rust) compare to Apache Spark?

from RW Overview — Arrow DataFusion Documentation

Although Ballista is largely inspired by Apache Spark, there are some key differences.

The choice of Rust as the main execution language means that memory usage is deterministic and avoids the overhead of GC pauses.
designed from the ground up:
- columnar data, enabling a number of efficiencies such as vectorized processing (SIMD and GPU) and efficient compression.
- Although Spark does have some columnar support, it is still largely row-based today.
The combination of Rust and Apache Arrow provides excellent memory efficiency and memory usage can be 5x - 10x lower than Apache Spark in some cases, which means that more processing can fit on a single node, reducing the overhead of distributed compute.
The use of Apache Arrow as the memory model and network protocol means that data can be exchanged between executors in any programming language with minimal serialization overhead.

Another great read RW Delta Lake Without Spark (Delta-Rs). Innovation, Cost Savings, and Other Such Matters. - Confessions of a Data Guy

# Ballista Roadmap

Unlike written in the docs RW Roadmap, Ballista was part of DataFusion, but separated now again. Andy Grove, creator of DataFusion and Ballista says (2023-02-22):

Ballista was part of the DataFusion repo and got moved out a while back so that each project could focus on the needs of the user base. DataFusion is a framework for in-process query engines that many products are now building on top of. Ballista is an end-user distributed system (although some companies are also using this as a foundation for new systems).

I am hoping that it will soon be pretty seamless to switch between DataFusion and Ballista when using these systems from Python by just changing the imports, but for now they are quite separate systems.

# Running Queries

Example sqlbench-runners/sqlbench-ballista.py at main · sql-benchmarks/sqlbench-runners · GitHub with Python.

Origin: David Gasquez
References: apache/arrow-ballista
Created 2022-10-20