Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

Arrow memory format supports zero-copy reads for lightning-fast data access without serialization overhead.
Columnar layout of data also allows for a better use of CPU caches by placing all data relevant to a column operation in as compact of a format as possible.
Arrow acts as a new high-performance interface between various systems. It is also focused on supporting a wide variety of industry-standard programming languages. Java, C, C++, Python are underway and more languages are expected soon.

For installation details see this

595 questions

votes

1 answer

Converted apache arrow file from data frame gives null while reading with arrow.js

I converted one sample dataframe to .arrow file using pyarrow import numpy as np import pandas as pd import pyarrow as pa df = pd.DataFrame({"a": [10, 2, 3]}) df['a'] = pd.to_numeric(df['a'],errors='coerce') table = pa.Table.from_pandas(df) writer…

asked Oct 09 '19 at 22:48

Sarath

9,030
11
51
84

votes

4 answers

How to solve pyspark `org.apache.arrow.vector.util.OversizedAllocationException` error by increasing spark's memory?

I'm running a job in pyspark where I at one point use a grouped aggregate Pandas UDF. This results in the following (here abbreviate) error: org.apache.arrow.vector.util.OversizedAllocationException: Unable to expand the buffer I'm fairly sure this…

apache-spark pyspark user-defined-functions apache-arrow

asked Oct 07 '19 at 12:29

Rasmus Bååth

4,827
4
29
29

votes

1 answer

Converting Arrow to Parquet and vice versa in java

I have been looking at ways to convert arrow to parquet and vice versa in Java. Even though the Python library for arrow has full support for the mentioned conversion, I can hardly find any documentation for the same in Java. Has anyone come across…

java parquet apache-arrow parquet-mr

asked Sep 17 '19 at 12:43

Optimus

votes

1 answer

Is there a Python module to read avro files with pyarrow?

I know there is pyarrow.parquet for reading parquet files as arrow table but i'm looking for the equivalent for avro?

pyarrow apache-arrow

asked Jun 05 '19 at 15:29

djohon

votes

1 answer

Can I [de]serialize a dictionary of dataframes in the arrow/js implementation?

I want to use Apache Arrow to send data from a Django backend to a Angular frontend. I want to use a dictionary of dataframes/tables as payload in messages. It's posssible with pyarrow to share data in this way between python microservices, but i…

javascript python ipc pyarrow apache-arrow

asked Jul 18 '18 at 19:02

gabomgp

votes

0 answers

A factor column used for Hive Partitioning in write_dataset() becomes a chr column in open_dataset()

In my R code, my data has an Id column which it stores as a factor (its values are a combination of characters and digits), and this works well. I am storing data in an arrow dataset that is being partitioned by the Id column, which works naturally…

r apache-arrow

asked May 30 '23 at 17:48

James

votes

1 answer

arrow_binary data type in R df

I'm trying to create an R dataframe by using arrow's read_parquet function. The parquet file is stored in S3. When I read in the file, many of the columns are of type arrow_binary. How can I read in these columns as strings?

r apache-arrow

asked Sep 18 '22 at 20:32

Stuart

votes

1 answer

Random sampling of parquet prior to collect

I want to randomly sample a dataset. If I already have that dataset loaded, I can do something like this: library(dplyr) set.seed(-1) mtcars %>% slice_sample(n = 3) # mpg cyl disp hp drat wt qsec vs am gear carb # AMC Javelin …

r dplyr parquet apache-arrow

asked Sep 09 '22 at 14:27

Dan

11,370
4
43
68

votes

2 answers

How to sort a Pyarrow table?

How do I sort an Arrow table in PyArrow? There does not appear to be a single function that will do this, the closest is sort_indices.

python pyarrow apache-arrow

asked Jan 28 '22 at 12:06

Contango

76,540
58
260
305

votes

2 answers

How to connect to parquet files in Azure Blob Storage with arrow::open_dataset?

I am open to other ways of doing this. Here are my constraints: I have parquet files in a container in Azure Blob Storage These parquet files will be partitioned by a product id, as well as the date (year/month/day) I am doing this in R, and want…

r azure-blob-storage azure-storage parquet apache-arrow

asked Dec 10 '21 at 13:37

Chris Umphlett

votes

3 answers

How would I go about converting a .csv to an .arrow file without loading it all into memory?

I found a similar question here: Read CSV with PyArrow In this answer it references sys.stdin.buffer and sys.stdout.buffer, but I am not exactly sure how that would be used to write the .arrow file, or name it. I can't seem to find the exact…

python pandas csv pyarrow apache-arrow

asked Oct 18 '21 at 01:32

kasbah512

votes

2 answers

How to partition a large julia DataFrame to an arrow file and process each partition sequentially when reading the data

I am working with very large DataFrames in Julia resulting in out of memory errors when I do joins and other manipulations on the data. Fortunately the the data can be partitioned on an identifier column. I want to persist the partitioned DataFrame…

dataframe julia apache-arrow

asked Jan 24 '21 at 11:06

Kobus Herbst

votes

1 answer

How to concatenate Apache Arrow files with identical structure in julia

How can I concatenate several Arrow files with identical structure into a single Arrow file without reading each file into memory? I am using Arrow.jl and the Arrow files represent dataframes with identical structure and the combined dataframes are…

dataframe julia apache-arrow

asked Jan 19 '21 at 06:21

Kobus Herbst

votes

1 answer

Unable to filter DataFrame created from Arrow table

I have the following function in julia, to read an Arrow file (using Arrow.jl) to read data from disk and process it: function getmembershipsdays(fromId, toId) memberships = Arrow.Table("HouseholdMemberships.arrow") |> DataFrame …

dataframe julia apache-arrow

asked Jan 04 '21 at 15:11

Kobus Herbst

votes

3 answers

How to write a pandas dataframe to .arrow file

How can I write a pandas dataframe to disk in .arrow format? I'd like to be able to read the arrow file into Arquero as demonstrated here.

pandas apache-arrow

asked Nov 01 '20 at 07:40

RobinL

11,009
8
48
68

Prev 1 2 3

…

39 40 Next