๐ง Second Brain
Search
Apache Arrow
-> Enhancing In-Memory Analytics. A In-Memory Formats.
Apache Arrow serves as a robust development platform for in-memory analytics. It encompasses a suite of technologies designed to expedite the processing and transfer of big data.
# Exploring the Arrow Libraries
Within the Arrow reference libraries, you’ll find an array of distinct software components:
- Columnar vectors and table-like containers (akin to data frames), supporting both flat and nested types.
- A rapid, language-neutral metadata messaging layer, leveraging Google’s Flatbuffers library.
- Reference-counted off-heap buffer memory management, facilitating zero-copy memory sharing and adept handling of memory-mapped files.
- IO interfaces for seamless interaction with both local and remote filesystems.
- Self-describing binary wire formats, both streaming and batch/file-like, for efficient remote procedure calls (RPC) and interprocess communication (IPC).
- Integration tests to ensure binary compatibility across implementations (e.g., transmitting data from Java to C++).
- Capabilities for converting to and from other in-memory data formats.
- Readers and writers compatible with various prevalent file formats, including Parquet and CSV.
# Visual Comparisons: Before and After
This project mirrors the objectives of Substrait Cross-Language Serialization for Relational Algebra.
# Versioning Insights
# v6.0.0
Source: RW Apache Arrow High-Performance Columnar Data Framework
Read more onย Data Lake File Formats.
# History
Apache Arrow was announced by The Apache Software Foundation onย February 17, 2016, with development led by a coalition of developers from other open source data analytics projects. The initial codebase and Java library was seeded by code from Apache Drill.
References:
- ROAPI
- similar engine, but close-source, VertiPaq
Created: 2021-10-14