Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

For installation details see this

595 questions
2
votes
1 answer

Apache Arrow OutOfMemoryException when PySpark reads Hive table to pandas

I searched for this kind of error, and I couldn't find any information on how to solve it. This is what I get when I execute the below two scripts: org.apache.arrow.memory.OutOfMemoryException: Failure while allocating memory. write.py import…
pgmank
  • 5,303
  • 5
  • 36
  • 52
2
votes
0 answers

How to read and update an object in python plasma?

I just discovered plasma https://arrow.apache.org/docs/python/plasma.html#putting-and-getting-python-objects and would like to store an object, get it, update and write back to the store. Is it possible? My failing trial looks right now as…
mojovski
  • 581
  • 7
  • 21
2
votes
1 answer

How to add/change column names with pyarrow.read_csv?

I am currently trying to import a big csv file (50GB+) without any headers into a pyarrow table with the overall target to export this file into the Parquet format and further to process it in a Pandas or Dask DataFrame. How can i specify the column…
azo91
  • 209
  • 1
  • 6
  • 15
2
votes
0 answers

How to build a golang apache arrow record that has a struct with a boolean field

When I build an arrow record that has a struct, and that struct has a field with the type arrow.FixedWidthTypes.Boolean, it later panics when trying to access the boolean value with the following error: runtime error: invalid memory address or…
Colton Morris
  • 95
  • 1
  • 5
2
votes
1 answer

CMake fails with when attempting to compile simple test program

I am attempting to follow the documentation for building pyarrow locally. Specifically, using the conda instructions: conda create -y -n pyarrow-dev -c conda-forge \ --file arrow/ci/conda_env_unix.yml \ --file arrow/ci/conda_env_cpp.yml \ …
Aleksey Bilogur
  • 3,686
  • 3
  • 30
  • 57
2
votes
2 answers

What is the difference between Apache Drill's ValueVectors and Apache Arrow?

Apache Drill has its own columnar representation like Apache Arrow. But Apache Arrow has support for more programming languages. I am looking forward to use Apache Drill but still I want the programming language support of Apache Arrow. Some sources…
JavaTechnical
  • 8,846
  • 8
  • 61
  • 97
2
votes
1 answer

Apache Arrow Plasma Client - Can't connect to memory store (UnsatisfiedLinkError)

I'm trying to use the Java API for Apache Arrow to connect to a memory store. I've done this in Python, successfully, using the Python API by following the guide here. I've also looked at the C++ API documentation, but it didn't help much. The Java…
SSS
  • 21
  • 2
2
votes
1 answer

How to filter records from a Parquet file using Python pyarrow

I'm trying to filter specific records from a parquet file. I'm using python pyarrow. I managed to do it with pandas (see code below). The problem it that is takes a lot of memory for a large parquet file. I'm looking for other options - any…
Ori N
  • 555
  • 10
  • 22
2
votes
1 answer

Apache arrow, alignment of numpy array with zero copy

I convert an arrow object with « zero copy », to panda, but the result object is not aligned. #create a pyarrow.table.Table from parquet file pq_file=pq.ParquetFile(parquet_file_name) arrow_table=pq_file.read() #convert pyarrow.table.Table to panda…
1
vote
1 answer

Filter based on a list column using arrow and duckdb

I'm using the R arrow package to interact with a duckdb table that contains a list column. My goal is to filter on the list column before collecting the results into memory. Can this be accomplished on a virtual duckdb table? Example library(arrow,…
davechilders
  • 8,693
  • 2
  • 18
  • 18
1
vote
1 answer

arrow::to_duckdb coerces int64 columns to doubles

arrow::to_duckdb() converts int64 columns to a double in the duckdb table. This happens if the .data being converted is an R data frame or a parquet file. How can I maintain the int64 data type? Example library(arrow, warn.conflicts =…
davechilders
  • 8,693
  • 2
  • 18
  • 18
1
vote
1 answer

What is actually meant when referring to parquet row-group size?

I am starting to work with the parquet file format. The official Apache site recommends large row groups of 512MB to 1GB (here). Several online source (e.g. this one) suggest that the default row group size is 128MB. I have a large number of parquet…
teejay
  • 103
  • 8
1
vote
2 answers

Arrow filter() with string expressions doesn't work with Shiny. Is there a workaround?

In contrast to this older question, filtering with string expressions does work with Arrow data sets now, but they don't work in the reactive Shiny environment. Is there a workaround? Here is the arrow-ified Shiny demo app. I've added a…
Art
  • 1,165
  • 6
  • 18
1
vote
1 answer

R arrow read_parquet: Call to R (seek() on R connection) from a non-R thread from an unsupported context

I am using the R arrow package read_parquet() function to read a parquet file. This function runs every night on several dozen files but as of today it has been failing for some of the files with the error message: Call to R (seek() on R…
pmac0451
  • 13
  • 4
1
vote
0 answers

ArrowInvalid: Cannot locate timezone 'UTC': Timezone database not found

I'm starting to experiment with pyarrow, and I'm hitting a strange error when writing a CSV file. Say I have this CSV input as dates.csv: dates 2022-10-04T15:52:25.000Z 2022-03-29T08:08:13.000Z 2023-01-05T19:24:13.000Z 2020-12-04T18:56:30.000Z Now,…
mrgou
  • 1,576
  • 2
  • 21
  • 45