Apache Spark
Apache Spark™ is a multi-language engine (Cluster-computing frameworks) for executing Data Engineering, Data Science, and Machine Learning on single-node machines or clusters.
Related Spark on Kubernetes.
Snowflake tried to build similar features with Snowpark.
# Spark Engines
# Improvements for Small Queries
SPIP: Faster queries in local laptop mode for Apache Spark:
Project Feather: Faster queries in local laptop mode for Apache Spark
Did you know that Apache Spark’s latest Project Feather introduces 3 major improvements for data processing on local laptops? They’re pretty straightforward: query compilation and task scheduling, an Arrow-based df.cache, and shuffle-free execution on single-node queries. Early prototypes are showing ~2x on small data.Spark’s architecture was designed for petabytes. Task scheduling, shuffle planning, and execution assume a cluster. On one machine with a few thousand rows, that overhead dominates. Project Feather is a new SPIP proposing to fix this.
The interesting thing isn’t that Spark can compete with DuckDB or Polars at small-data speed. It’s that a lot of developers want to start small and scale up without switching tools. Feather makes that a viable pattern for engineers who want to do things like develop locally with agents, or take advantage of the full Spark ecosystem without compromising on performance when prototyping.
The answer to Apache Spark Alternatives (?).
# Features
- Proposal for adding Measures: SPIP: Metrics & semantic modeling in Spark
# Spark Alternative
Origin:
