Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

For installation details see this

595 questions
1
vote
0 answers

unable to cast refs returned from `ChunkedArray::chunks` to concrete arrow type

I'm trying to extract the raw buffers from ChunkedArray. The arrow2 documentation suggests doing this by casting a &dyn arrow2::array::Array as its concrete type, see here. This seems to work fine when I create an arrow buffer directly, however,…
1
vote
2 answers

Trying to use arrow-dataset java library but got missing arrow_dataset_jni.dll error

I followed the maven instructions to include the arrow-dataset in pom.xml However, when running the code, it complained arrow-dataset-jni.dll not found How to create or install dll ? Thank you J
Jac
  • 29
  • 2
1
vote
2 answers

PyArrow issue with timestamp data

I am trying to load data from a csv into a parquet file using pyarrow. I am using the convert options to set the data types to their proper type and then using the timestamp_parsers option to dictate how the timestamp data should be interpreted:…
1
vote
2 answers

Selecting deep columns in pyarrow.dataset parquet

Let's say I have a deeply nested arrow table like: pyarrow.Table arr: struct not null, b: list not null> not null> child 0, arr: struct not null, b:…
mdurant
  • 27,272
  • 5
  • 45
  • 74
1
vote
2 answers

Java Apache Arrow Copy data from one VectorSchemaRoot to another

If I have a VectorSchemaRoot that already contains data using the the Java Apache Arrow library, how would I go about copying that data to another VectorSchemaRoot?
jjbskir
  • 8,474
  • 9
  • 40
  • 53
1
vote
0 answers

can't import pyarrow on macOS because symbol __Py_FatalErrorFunc is not found in lib.cpython-37m-darwin.so

I'm on MacOS Monterey 12.5, m1 chip Using Python 3.7.13 in a virtualenv created as follows: pyenv install 3.7.13 pyenv virtualenv 3.7.13 qtrainer pyenv activate qtrainer OpenSSL version is 1.1.1q apache-arrow version is 9.0.0 my .zshrc file…
Borbag
  • 597
  • 4
  • 21
1
vote
0 answers

reading multiple parquet files in java takes unresonable amount of memory

Reading 20 uncompressed parquet files with total size 3.2GB, takes more then 12GB in RAM, when reading them "concurrently". "concurrently" means that I need to read the second file before closing the first file, not multithreading. The data is time…
driedplum
  • 11
  • 1
1
vote
2 answers

How to avoid getting a memory leak while copying a VectorSchemaRoot

I need to copy all of the contents of a stream of VectorSchemaRoots into a single object: Stream data = fetchStream(); VectorSchemaRoot finalResult = VectorSchemaRoot.create(schema, allocator); VectorLoader = new…
Pablo
  • 1,302
  • 1
  • 16
  • 35
1
vote
0 answers

memory use for reading the same .csv file using baseR::read.csv(), readr::read_csv(), data.table::fread(), and arrow::read_csv_arrow() in R

I tried to read the same .csv file using different functions in R (base::read.csv(), readr::read_csv(), data.table::fread(), and arrow::read_csv_arrow()), but this same file leads to very different sizes in memory. See an example…
Miao Cai
  • 902
  • 9
  • 25
1
vote
1 answer

What is the difference between StringType and LargeStringType in Apache Arrow?

According to documentation: class arrow::StringType : public arrow::BinaryType #include Concrete type class for variable-size string data, utf8-encoded. class arrow::LargeStringType : public arrow::LargeBinaryType #include…
1
vote
1 answer

list not supported in join non-key field?

I am trying to join 2 Arrow tables where some columns are of list data type. Note that my join columns/keys are primitive data types and some my non-join columns/keys are of list. But, PyArrow join() cannot join such as table, although…
1
vote
0 answers

Authenticating R Arrow With Temporary AWS Credentials in a Profile?

I am trying to use the arrow R package to read a parquet file from s3. The documentation only describes how to specifying AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY when authenticating for access to private s3 bucket. However, I have to generate…
Ramón J Romero y Vigil
  • 17,373
  • 7
  • 77
  • 125
1
vote
1 answer

How to query pyarrow table stuct field

I have a table, let's say 2 columns A (list), B (list) and 2 rows: A: ["X", "Y"], ["Y", "Z"] B: [1, 3], [5, 6] I'd like to achieve something like SELECT * FROM table WHERE A.Y = 5 and it'd return a single (second) row. How do I achieve this using…
alippai
  • 168
  • 1
  • 6
1
vote
0 answers

how to vectorize arrow::compute::Take?

I have an array of large size input_array and an array of offsets take_array. I want to return the elements with those offsets very fast. Can I vectorize it for the arrow array? If so, how? arrow::compute::Take(input_array, take_array) Use Case: I…
cpchung
  • 774
  • 3
  • 8
  • 23
1
vote
1 answer

Is there an established means of using AzureStor and arrow together in R?

In the arrow R guide there's info about using S3 buckets but nothing about using Azure cloud storage. There's an unrelated package AzureStor which connects to Azure Storage but uses different syntax so they don't (seemingly) work together. Is there…
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72