Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

Arrow memory format supports zero-copy reads for lightning-fast data access without serialization overhead.
Columnar layout of data also allows for a better use of CPU caches by placing all data relevant to a column operation in as compact of a format as possible.
Arrow acts as a new high-performance interface between various systems. It is also focused on supporting a wide variety of industry-standard programming languages. Java, C, C++, Python are underway and more languages are expected soon.

For installation details see this

595 questions

votes

1 answer

How to use Apache Arrow IPC from multiple processes (possibly from different languages)?

I'm not sure where to begin, so looking for some guidance. I'm looking for a way to create some arrays/tables in one process, and have it accessible (read-only) from another. So I create a pyarrow.Table like this: a1 = pa.array(list(range(3))) a2 =…

asked Feb 08 '23 at 23:34

suvayu

4,271
2
29
35

votes

1 answer

R Arrow returns wrong column when multiple group_by / summarise

I have a query that has multiple group-by - summarise statements. When I ungroup the data between everything works fine, but if I don't one of the columns is replaced by another. I would expect the columns to not be changed. For example in the…

r apache-arrow

asked Dec 07 '22 at 01:43

David

9,216
4
45
78

votes

0 answers

How to add a column with an index to an apache arrow dataset in R?

I'm trying to add an index to a dataset which is too large to fit in RAM. The tidyverse way of adding an index would be: library(tidyverse) df = mtcars df |> mutate(row_id = 1:nrow(cyl)) # any column name in the df Dplyr backend for Arrow doesn't…

r dplyr dataset apache-arrow

asked Jun 27 '22 at 07:21

David Budzynski

votes

1 answer

Is there an Apache Arrow equivalent of the Spark Pandas UDF

Spark provides a few different ways to implement UDFs that consume and return Pandas DataFrames. I am currently using the cogrouped version that takes two (co-grouped) Pandas DataFrames as input and returns a third. For efficient translation between…

pandas apache-spark user-defined-functions apache-arrow python-polars

asked Mar 24 '22 at 16:43

Plug1

votes

1 answer

Is it possible to append rows to an existing Arrow (PyArrow) Table?

I am aware that "Many Arrow objects are immutable: once constructed, their logical properties cannot change anymore" (docs). In this blog post by one of the Arrow creators it's said Table columns in Arrow C++ can be chunked, so that appending to a…

pyarrow apache-arrow

asked Mar 10 '22 at 17:58

astrojuanlu

6,744
8
45
105

votes

2 answers

PyArrow: How to copy files from local to remote using new filesystem interface?

Could somebody give me a hint on how can I copy a file form a local filesystem to a HDFS filesystem using PyArrow's new filesystem interface (i.e. upload, copyFromLocal)? I have read the documentation back-and-forth, and tried a few things out…

python hdfs pyarrow apache-arrow

asked Jul 28 '21 at 11:11

Andor

5,523
5
26
24

votes

0 answers

SparkR code fails if Apache Arrow is enabled

I am running gapply function on SparkRDataframe which looks like below df<-gapply(sp_Stack, function(key,e) { Sys.setlocale('LC_COLLATE','C') suppressPackageStartupMessages({ library(Rcpp) library(Matrix) …

apache-spark google-cloud-dataproc sparkr apache-arrow

asked Jul 09 '21 at 08:45

Benak Raj

votes

2 answers

Create parquet file directory from CSV file in R

I'm running into more and more situations where I need out-of-memory (OOM) approaches to data analytics in R. I am familiar with other OOM approaches, like sparklyr and DBI but I recently came across arrow and would like to explore it more. The…

r csv import parquet apache-arrow

asked Mar 19 '21 at 15:17

Trent

votes

1 answer

Use Arrow schema with parquet StreamWriter

I am attempting to use the C++ StreamWriter class provided by Apache Arrow. The only example of using StreamWriter uses the low-level Parquet API i.e. parquet::schema::NodeVector fields; fields.push_back(parquet::schema::PrimitiveNode::Make( …

c++ parquet apache-arrow

asked Oct 20 '20 at 19:21

hyperdelia

1,105
6
26

votes

0 answers

Apache-arrow JS implementation

I have a MEAN stack application that connects to customer databases and third-party data. From JS front end I need to be able to read parquet and big-data CSV files. In this regard please clarify my understanding : I cannot read parquet file using…

javascript node.js parquet apache-arrow

asked Sep 29 '20 at 04:49

user14013917

votes

1 answer

Do memory mapped files in Docker containers in Kubernetes work the same as in regular processes in Linux?

I have process A and process B. Process A opens a file, calls mmap and write to it, process B do the same but reads the same mapped region when process A has finished writing. Using mmap, process B is suppossed to read the file from memory instead…

docker kubernetes pyarrow apache-arrow

asked Sep 18 '20 at 22:56

rboc

votes

1 answer

PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

I'm trying to write a large parquet file onto disk (larger then memory). I naively thought I can be clever and use ParquetWriter and write_table to incrementally write a file, like this (POC): import pyarrow as pa import pyarrow.parquet as pq import…

python parquet pyarrow apache-arrow

asked Sep 14 '20 at 20:11

Niklas B

1,839
18
36

votes

2 answers

How to get the arrow package for R with lz4 support?

The R package arrow installed with install.packages('arrow') does not have lz4 support: codec_is_available('lz4') # [1] FALSE The package version is: packageVersion('arrow') # [1] ‘0.17.1’ This is on Ubuntu 20.04. How can I get an R arrow package…

r apache-arrow

asked Jul 26 '20 at 04:19

James Hirschorn

7,032
5
45
53

votes

2 answers

Arrow + Java: Populate VectorSchemaRoot (from stream / file) | Memory-Ownership | Usage patterns

I'm doing very basic experiments with Apache Arrow, mostly in regards to passing some data between Java, C++, Python using Arrow's IPC format (to file), Parquet format (to file) and IPC format (stream through JNI). C++ and Python looks somewhat…

java apache-arrow

asked Jul 16 '20 at 15:33

sascha

32,238
6
68
110

votes

1 answer

How can we store a hash table in Apache Arrow?

I am pretty new to Apache Arrow so this question may be ignorant. Apache Arrow provides the capability to store data structures like primitive types/struct/array in standardised memory format, I wonder if it is possible to store more complex data…

apache-arrow

asked Dec 16 '19 at 02:09

nybon

8,894
9
59
67

Prev 1 2

…

39 40 Next