Apache Arrow一个跨平台的数据层来加快大数据分析项目的运行速度
Apache Arrow是一个跨平台的数据层来加快大数据分析项目的运行速度。 专为加速大数据而设计的柱状内存分析层。 它包含一组对平面和分层数据的规范内存表示以及用于结构操作的多个语言绑定。 它还提供IPC和公共算法实现。
Java 科学计算与分析
共6174Star
详细介绍
Apache Arrow
Build Status | |
Powering Columnar In-Memory Analytics
Arrow is a set of technologies that enable big-data systems to process and move data fast.
Initial implementations include:
Arrow is an Apache Software Foundation project. Learn more at arrow.apache.org.
What's in the Arrow libraries?
The reference Arrow implementations contain a number of distinct software components:
- Columnar vector and table-like containers (similar to data frames) supporting flat or nested types
- Fast, language agnostic metadata messaging layer (using Google's Flatbuffers library)
- Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files
- Low-overhead IO interfaces to files on disk, HDFS (C++ only)
- Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)
- Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)
- Conversions to and from other in-memory data structures (e.g. Python's pandas library)
Getting involved
Right now the primary audience for Apache Arrow are the developers of data systems; most people will use Apache Arrow indirectly through systems that use it for internal data handling and interoperating with other Arrow-enabled systems.
Even if you do not plan to contribute to Apache Arrow itself or Arrow integrations in other projects, we'd be happy to have you involved:
- Join the mailing list: send an email to dev-subscribe@arrow.apache.org. Share your ideas and use cases for the project.
- Follow our activity on JIRA
- Learn the format
- Contribute code to one of the reference implementations