Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

For installation details see this

595 questions
1
vote
1 answer

Arrow Julia to Python - Read Record Batch Stream

I am trying to read an arrow file that I wrote as a sequence of record batches in python. For some reason I am only getting the first struct entry. I have verified the files are bigger than one item and of expected size. with…
BAR
  • 15,909
  • 27
  • 97
  • 185
1
vote
1 answer

Questions on arrow support in pandas

Currently pandas support 3 dtype backends : numpy, native nullable (extended) dtypes, and pyarrow dtypes. My understanding is that arrow will eventually replace numpy as the pandas backend, even if it is most likely a long term goal. Considering to…
user3022222
  • 857
  • 9
  • 9
1
vote
0 answers

Unable to enable SparkR arrow

This is the code that I used to enable Arrow library(SparkR) Sys.setenv(SPARK_HOME = "/home/spark/spark-3.3.2-bin-hadoop3") install.packages("arrow", repos="https://cloud.r-project.org/") library(arrow) library(sparklyr) sparkR.session(master =…
Vortenzie
  • 75
  • 9
1
vote
1 answer

Difficulty with unifying schemas when trying to open arrow dataset with two different file formats in R

I'm trying to open a FileSystemDataset using arrow::open_dataset() from a directory that contains two different file formats (csv & parquet). The single parquet file also has an additional field (age_group). The approach needs to be generalisable as…
Anna Krystalli
  • 413
  • 3
  • 13
1
vote
0 answers

C++ write a parquet file with parquet::StreamWriter

I am trying to write a parquet file with some fake data. The documentation is not very clear on the types to be used with the out stream: it iterates over a non-specified data structure and access its methods. The following code gets as an…
roschach
  • 8,390
  • 14
  • 74
  • 124
1
vote
0 answers

How to export arrow-rs array to Java?

In the C Data Interface page of Arrow Java Document, the Java code allocated some memory, and the C++ code fill the allocated memory. While it seems arrow-rs works in a different way, it fills the data first and then pass the pointer to a foreign…
Renkai
  • 1,991
  • 2
  • 13
  • 18
1
vote
1 answer

how to declare and then initialize a parquet arrow::Rand omAccessFile in C++

I am trying to readapt this piece of code of the parquet C++ documentation to open a parquet file with a FileReader to get the minimum value of a column. arrow::MemoryPool* pool =…
roschach
  • 8,390
  • 14
  • 74
  • 124
1
vote
1 answer

What is the use of PyArrow Tensor class?

In the Arrow documentation there is a class named Tensor that is created from numpy ndarrays. However, the documentation is pretty sparse, and after playing a bit I haven't found an use case for it. For example, you can't construct a table with…
Adrian
  • 755
  • 9
  • 17
1
vote
0 answers

R: loop with unlisting `data.table` operation before saving as parquet file with `arrow` package

Because my data is massive, I am working with data.table & arrow packages. I learned how to use the arrow package from this post: R (data.table): computing mean after join in most efficient way I need to do some operations before saving my data as…
PaulaSpinola
  • 531
  • 2
  • 10
1
vote
0 answers

How do I convert a Polars DataFrame to Vec?

edit: To hopefully be more concise, how do I do this?- use polars::prelude::{DataFrame, NamedFrom, df}; use arrow::record_batch::RecordBatch; fn main() { let polars_df: DataFrame = df!("cat_data" => &[1.0, 2.0, 3.0, 4.0], …
1
vote
1 answer

How can I use golang apache arrow library to read repeated field for parquet?

I am using apache arrow golang library to read parquet. No-repeated column seems straight forward, but how can I read repeated field?
flexwang
  • 625
  • 6
  • 16
1
vote
0 answers

Apache Arrow Flight: Releasing the flight stream that was created by GetFlightInfo

According to the Arrow Flight protocol definition, a client(consumer) can let the server generate a flight stream through a specified descriptor in GetFlightInfo. And the flight stream will be available for the duration defined by the server(a…
zeodtr
  • 10,645
  • 14
  • 43
  • 60
1
vote
1 answer

How to create Apache Arrow vectors in Java, pass them to C++ code through JNI, read/write them in C++

I've been reading the Apache Arrow docs, and I've figured out how to use it in Java and C++. But what I'd like to do is offload some work to JNI (C/C++) code from Java, and the documentation (e.g. https://arrow.apache.org/docs/java/cdata.html) just…
1
vote
2 answers

What's the purpose of using pointer to std::shared_ptr in C++ library Gandiva

I'm learning the Gandiva module in Apache Arrow. I found that many APIs require parameters in the form of std::shared_ptr*, eg here is an typical API: static inline Status Make(SchemaPtr schema, ConditionPtr condition, std::shared_ptr
Hua
  • 184
  • 8
1
vote
1 answer

Converting characters to timestamp in an arrow table in R

I want to convert a character string to a timestamp in an arrow table. I am using arrow because I am handling a large number of sizeable csvs. I succeed in converting the string to a datetime object in a data frame but the same operation produces…
Robert Hawlik
  • 442
  • 3
  • 15