Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

For installation details see this

595 questions
5
votes
1 answer

How to use Apache Arrow IPC from multiple processes (possibly from different languages)?

I'm not sure where to begin, so looking for some guidance. I'm looking for a way to create some arrays/tables in one process, and have it accessible (read-only) from another. So I create a pyarrow.Table like this: a1 = pa.array(list(range(3))) a2 =…
suvayu
  • 4,271
  • 2
  • 29
  • 35
5
votes
1 answer

R Arrow returns wrong column when multiple group_by / summarise

I have a query that has multiple group-by - summarise statements. When I ungroup the data between everything works fine, but if I don't one of the columns is replaced by another. I would expect the columns to not be changed. For example in the…
David
  • 9,216
  • 4
  • 45
  • 78
5
votes
0 answers

How to add a column with an index to an apache arrow dataset in R?

I'm trying to add an index to a dataset which is too large to fit in RAM. The tidyverse way of adding an index would be: library(tidyverse) df = mtcars df |> mutate(row_id = 1:nrow(cyl)) # any column name in the df Dplyr backend for Arrow doesn't…
David Budzynski
  • 169
  • 1
  • 6
5
votes
1 answer

Is there an Apache Arrow equivalent of the Spark Pandas UDF

Spark provides a few different ways to implement UDFs that consume and return Pandas DataFrames. I am currently using the cogrouped version that takes two (co-grouped) Pandas DataFrames as input and returns a third. For efficient translation between…
5
votes
1 answer

Is it possible to append rows to an existing Arrow (PyArrow) Table?

I am aware that "Many Arrow objects are immutable: once constructed, their logical properties cannot change anymore" (docs). In this blog post by one of the Arrow creators it's said Table columns in Arrow C++ can be chunked, so that appending to a…
astrojuanlu
  • 6,744
  • 8
  • 45
  • 105
5
votes
2 answers

PyArrow: How to copy files from local to remote using new filesystem interface?

Could somebody give me a hint on how can I copy a file form a local filesystem to a HDFS filesystem using PyArrow's new filesystem interface (i.e. upload, copyFromLocal)? I have read the documentation back-and-forth, and tried a few things out…
Andor
  • 5,523
  • 5
  • 26
  • 24
5
votes
0 answers

SparkR code fails if Apache Arrow is enabled

I am running gapply function on SparkRDataframe which looks like below df<-gapply(sp_Stack, function(key,e) { Sys.setlocale('LC_COLLATE','C') suppressPackageStartupMessages({ library(Rcpp) library(Matrix) …
5
votes
2 answers

Create parquet file directory from CSV file in R

I'm running into more and more situations where I need out-of-memory (OOM) approaches to data analytics in R. I am familiar with other OOM approaches, like sparklyr and DBI but I recently came across arrow and would like to explore it more. The…
Trent
  • 771
  • 5
  • 19
5
votes
1 answer

Use Arrow schema with parquet StreamWriter

I am attempting to use the C++ StreamWriter class provided by Apache Arrow. The only example of using StreamWriter uses the low-level Parquet API i.e. parquet::schema::NodeVector fields; fields.push_back(parquet::schema::PrimitiveNode::Make( …
hyperdelia
  • 1,105
  • 6
  • 26
5
votes
0 answers

Apache-arrow JS implementation

I have a MEAN stack application that connects to customer databases and third-party data. From JS front end I need to be able to read parquet and big-data CSV files. In this regard please clarify my understanding : I cannot read parquet file using…
user14013917
  • 149
  • 1
  • 10
5
votes
1 answer

Do memory mapped files in Docker containers in Kubernetes work the same as in regular processes in Linux?

I have process A and process B. Process A opens a file, calls mmap and write to it, process B do the same but reads the same mapped region when process A has finished writing. Using mmap, process B is suppossed to read the file from memory instead…
rboc
  • 344
  • 3
  • 10
5
votes
1 answer

PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

I'm trying to write a large parquet file onto disk (larger then memory). I naively thought I can be clever and use ParquetWriter and write_table to incrementally write a file, like this (POC): import pyarrow as pa import pyarrow.parquet as pq import…
Niklas B
  • 1,839
  • 18
  • 36
5
votes
2 answers

How to get the arrow package for R with lz4 support?

The R package arrow installed with install.packages('arrow') does not have lz4 support: codec_is_available('lz4') # [1] FALSE The package version is: packageVersion('arrow') # [1] ‘0.17.1’ This is on Ubuntu 20.04. How can I get an R arrow package…
James Hirschorn
  • 7,032
  • 5
  • 45
  • 53
5
votes
2 answers

Arrow + Java: Populate VectorSchemaRoot (from stream / file) | Memory-Ownership | Usage patterns

I'm doing very basic experiments with Apache Arrow, mostly in regards to passing some data between Java, C++, Python using Arrow's IPC format (to file), Parquet format (to file) and IPC format (stream through JNI). C++ and Python looks somewhat…
sascha
  • 32,238
  • 6
  • 68
  • 110
5
votes
1 answer

How can we store a hash table in Apache Arrow?

I am pretty new to Apache Arrow so this question may be ignorant. Apache Arrow provides the capability to store data structures like primitive types/struct/array in standardised memory format, I wonder if it is possible to store more complex data…
nybon
  • 8,894
  • 9
  • 59
  • 67
1 2
3
39 40