Search

Search IconIcon to open search

Apache Arrow

Last updated by Simon Späti

Apache Arrow is in-memory analytics format. It serves as a robust development platform for in-memory analytics. It encompasses a suite of technologies designed to expedite the processing and transfer of big data.

# Exploring the Arrow Libraries

Within the Arrow reference libraries, you’ll find an array of distinct software components:

  • Columnar vectors and table-like containers (akin to data frames), supporting both flat and nested types.
  • A rapid, language-neutral metadata messaging layer, leveraging Google’s Flatbuffers library.
  • Reference-counted off-heap buffer memory management, facilitating zero-copy memory sharing and adept handling of memory-mapped files.
  • IO interfaces for seamless interaction with both local and remote filesystems.
  • Self-describing binary wire formats, both streaming and batch/file-like, for efficient remote procedure calls (RPC) and interprocess communication (IPC).
  • Integration tests to ensure binary compatibility across implementations (e.g., transmitting data from Java to C++).
  • Capabilities for converting to and from other in-memory data formats.
  • Readers and writers compatible with various prevalent file formats, including Parquet and CSV.

# Visual Comparisons: Before and After

This project mirrors the objectives of Substrait Cross-Language Serialization for Relational Algebra.

# Ecosystem

  • Arrow: Defines a language-agnostic, in-memory columnar format where data can be shared between systems without serialization (no costly conversions like JSON or CSV)
  • Apache Arrow Flight (Protocol): A high-performance RPC framework built on gRPC to transfer ‘Arrow’ data between client and server at wire speed
  • Arrow Flight SQL: A protocol that extends Flight to send and execute SQL queries
  • ADBC: provides a standardized API for database interactions, making it easier for developers to query and work with databases using Arrow-native data (with/without Flight SQL)

    Source

# History

Apache Arrow was announced by The Apache Software Foundation on February 17, 2016, with development led by a coalition of developers from other open source data analytics projects. The initial codebase and Java library was seeded by code from Apache Drill.

# Versioning Insights

# v6.0.0


Source: RW Apache Arrow High-Performance Columnar Data Framework

Read more on Data Lake File Formats.

# Further Reads


References: