We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Charlas - Raúl Cumplido: Apache Arrow - El format columnar! Lo cualo?
Learn about Apache Arrow, a high-performance columnar memory format that enables fast analytics and efficient data transfer between systems and programming languages.
-
Apache Arrow is a columnar memory format designed to improve performance of analytical algorithms and data transfer efficiency between systems and languages
-
Key benefits of columnar format include:
- Better compression ratios for similar data
- Efficient SIMD operations using CPU vector capabilities
- No need to copy/convert data between systems that understand Arrow format
- Improved performance for analytical operations
-
Arrow’s memory layout consists of:
- Validity bitmap buffers to indicate null values
- Offset buffers for variable-length data types
- Value buffers containing the actual data
-
Data types supported include:
- Fixed-size primitives (integers, floats)
- Variable-length types (strings, binary)
- Nested types (lists, structs)
-
Major implementations and ecosystem:
- Official libraries in multiple languages (C++, Python, R, etc.)
- Integration with popular frameworks like Pandas, Dask, Spark
- Used by tools like OpenTelemetry for data transfer
- PyArrow is among top 50 most downloaded PyPI packages
-
Arrow has become the de-facto standard for:
- In-memory analytics
- Cross-language data sharing
- Data transfer between systems
- File system formats like Parquet
-
Performance improvements come from:
- Zero-copy data sharing
- Native CPU vectorization support
- Efficient compression of columnar data
- Direct memory access without serialization overhead