Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

Arrow memory format supports zero-copy reads for lightning-fast data access without serialization overhead.
Columnar layout of data also allows for a better use of CPU caches by placing all data relevant to a column operation in as compact of a format as possible.
Arrow acts as a new high-performance interface between various systems. It is also focused on supporting a wide variety of industry-standard programming languages. Java, C, C++, Python are underway and more languages are expected soon.

For installation details see this

595 questions

votes

1 answer

Apache Arrow OutOfMemoryException when PySpark reads Hive table to pandas

I searched for this kind of error, and I couldn't find any information on how to solve it. This is what I get when I execute the below two scripts: org.apache.arrow.memory.OutOfMemoryException: Failure while allocating memory. write.py import…

asked Aug 20 '19 at 14:41

pgmank

5,303
5
36
52

votes

0 answers

How to read and update an object in python plasma?

I just discovered plasma https://arrow.apache.org/docs/python/plasma.html#putting-and-getting-python-objects and would like to store an object, get it, update and write back to the store. Is it possible? My failing trial looks right now as…

python pyarrow apache-arrow

asked Aug 01 '19 at 14:52

mojovski

votes

1 answer

How to add/change column names with pyarrow.read_csv?

I am currently trying to import a big csv file (50GB+) without any headers into a pyarrow table with the overall target to export this file into the Parquet format and further to process it in a Pandas or Dask DataFrame. How can i specify the column…

pyarrow apache-arrow

asked Jul 19 '19 at 11:08

azo91

votes

0 answers

How to build a golang apache arrow record that has a struct with a boolean field

When I build an arrow record that has a struct, and that struct has a field with the type arrow.FixedWidthTypes.Boolean, it later panics when trying to access the boolean value with the following error: runtime error: invalid memory address or…

go apache-arrow

asked May 02 '19 at 00:32

Colton Morris

votes

1 answer

CMake fails with when attempting to compile simple test program

I am attempting to follow the documentation for building pyarrow locally. Specifically, using the conda instructions: conda create -y -n pyarrow-dev -c conda-forge \ --file arrow/ci/conda_env_unix.yml \ --file arrow/ci/conda_env_cpp.yml \ …

c++ xcode cmake pyarrow apache-arrow

asked Apr 22 '19 at 17:08

Aleksey Bilogur

3,686
3
30
57

votes

2 answers

What is the difference between Apache Drill's ValueVectors and Apache Arrow?

Apache Drill has its own columnar representation like Apache Arrow. But Apache Arrow has support for more programming languages. I am looking forward to use Apache Drill but still I want the programming language support of Apache Arrow. Some sources…

apache-drill apache-arrow

asked Nov 29 '18 at 07:02

JavaTechnical

8,846
8
61
97

votes

1 answer

Apache Arrow Plasma Client - Can't connect to memory store (UnsatisfiedLinkError)

I'm trying to use the Java API for Apache Arrow to connect to a memory store. I've done this in Python, successfully, using the Python API by following the guide here. I've also looked at the C++ API documentation, but it didn't help much. The Java…

java sockets java-native-interface pyarrow apache-arrow

asked Nov 09 '18 at 17:59

SSS

votes

1 answer

How to filter records from a Parquet file using Python pyarrow

I'm trying to filter specific records from a parquet file. I'm using python pyarrow. I managed to do it with pandas (see code below). The problem it that is takes a lot of memory for a large parquet file. I'm looking for other options - any…

python parquet apache-arrow

asked Jul 30 '18 at 10:45

Ori N

votes

1 answer

Apache arrow, alignment of numpy array with zero copy

I convert an arrow object with « zero copy », to panda, but the result object is not aligned. #create a pyarrow.table.Table from parquet file pq_file=pq.ParquetFile(parquet_file_name) arrow_table=pq_file.read() #convert pyarrow.table.Table to panda…

python numpy apache-arrow

asked Feb 26 '18 at 15:59

frederic Gillardo

vote

1 answer

Filter based on a list column using arrow and duckdb

I'm using the R arrow package to interact with a duckdb table that contains a list column. My goal is to filter on the list column before collecting the results into memory. Can this be accomplished on a virtual duckdb table? Example library(arrow,…

r apache-arrow duckdb

asked Aug 23 '23 at 22:13

davechilders

8,693
2
18
18

vote

1 answer

arrow::to_duckdb coerces int64 columns to doubles

arrow::to_duckdb() converts int64 columns to a double in the duckdb table. This happens if the .data being converted is an R data frame or a parquet file. How can I maintain the int64 data type? Example library(arrow, warn.conflicts =…

r apache-arrow duckdb

asked Aug 23 '23 at 20:00

davechilders

8,693
2
18
18

vote

1 answer

What is actually meant when referring to parquet row-group size?

I am starting to work with the parquet file format. The official Apache site recommends large row groups of 512MB to 1GB (here). Several online source (e.g. this one) suggest that the default row group size is 128MB. I have a large number of parquet…

parquet pyarrow apache-arrow

asked Jul 27 '23 at 17:06

teejay

vote

2 answers

Arrow filter() with string expressions doesn't work with Shiny. Is there a workaround?

In contrast to this older question, filtering with string expressions does work with Arrow data sets now, but they don't work in the reactive Shiny environment. Is there a workaround? Here is the arrow-ified Shiny demo app. I've added a…

r dplyr apache-arrow

asked Jul 25 '23 at 19:43

Art

1,165
6
18

vote

1 answer

R arrow read_parquet: Call to R (seek() on R connection) from a non-R thread from an unsupported context

I am using the R arrow package read_parquet() function to read a parquet file. This function runs every night on several dozen files but as of today it has been failing for some of the files with the error message: Call to R (seek() on R…

r azure-storage parquet apache-arrow

asked Jul 21 '23 at 16:10

pmac0451

vote

0 answers

ArrowInvalid: Cannot locate timezone 'UTC': Timezone database not found

I'm starting to experiment with pyarrow, and I'm hitting a strange error when writing a CSV file. Say I have this CSV input as dates.csv: dates 2022-10-04T15:52:25.000Z 2022-03-29T08:08:13.000Z 2023-01-05T19:24:13.000Z 2020-12-04T18:56:30.000Z Now,…

python datetime timezone pyarrow apache-arrow

asked Jul 06 '23 at 13:07

mrgou

1,576
2
21
45

Prev 1 2 3

…

39 40 Next