Search

Search IconIcon to open search

Apache Arrow

Last updated by Simon Späti

-> Enhancing In-Memory Analytics. A In-Memory Formats.

Apache Arrow serves as a robust development platform for in-memory analytics. It encompasses a suite of technologies designed to expedite the processing and transfer of big data.

# Exploring the Arrow Libraries

Within the Arrow reference libraries, you’ll find an array of distinct software components:

  • Columnar vectors and table-like containers (akin to data frames), supporting both flat and nested types.
  • A rapid, language-neutral metadata messaging layer, leveraging Google’s Flatbuffers library.
  • Reference-counted off-heap buffer memory management, facilitating zero-copy memory sharing and adept handling of memory-mapped files.
  • IO interfaces for seamless interaction with both local and remote filesystems.
  • Self-describing binary wire formats, both streaming and batch/file-like, for efficient remote procedure calls (RPC) and interprocess communication (IPC).
  • Integration tests to ensure binary compatibility across implementations (e.g., transmitting data from Java to C++).
  • Capabilities for converting to and from other in-memory data formats.
  • Readers and writers compatible with various prevalent file formats, including Parquet and CSV.

# Visual Comparisons: Before and After

This project mirrors the objectives of Substrait Cross-Language Serialization for Relational Algebra.

# Versioning Insights

# v6.0.0


Source: RW Apache Arrow High-Performance Columnar Data Framework

Read more on Data Lake File Formats.

# History

Apache Arrow was announced by The Apache Software Foundation on February 17, 2016, with development led by a coalition of developers from other open source data analytics projects. The initial codebase and Java library was seeded by code from Apache Drill.

# Further Reads


References: