Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

For installation details see this

595 questions
5
votes
1 answer

Converted apache arrow file from data frame gives null while reading with arrow.js

I converted one sample dataframe to .arrow file using pyarrow import numpy as np import pandas as pd import pyarrow as pa df = pd.DataFrame({"a": [10, 2, 3]}) df['a'] = pd.to_numeric(df['a'],errors='coerce') table = pa.Table.from_pandas(df) writer…
Sarath
  • 9,030
  • 11
  • 51
  • 84
5
votes
4 answers

How to solve pyspark `org.apache.arrow.vector.util.OversizedAllocationException` error by increasing spark's memory?

I'm running a job in pyspark where I at one point use a grouped aggregate Pandas UDF. This results in the following (here abbreviate) error: org.apache.arrow.vector.util.OversizedAllocationException: Unable to expand the buffer I'm fairly sure this…
5
votes
1 answer

Converting Arrow to Parquet and vice versa in java

I have been looking at ways to convert arrow to parquet and vice versa in Java. Even though the Python library for arrow has full support for the mentioned conversion, I can hardly find any documentation for the same in Java. Has anyone come across…
Optimus
  • 697
  • 2
  • 8
  • 22
5
votes
1 answer

Is there a Python module to read avro files with pyarrow?

I know there is pyarrow.parquet for reading parquet files as arrow table but i'm looking for the equivalent for avro?
djohon
  • 705
  • 2
  • 10
  • 25
5
votes
1 answer

Can I [de]serialize a dictionary of dataframes in the arrow/js implementation?

I want to use Apache Arrow to send data from a Django backend to a Angular frontend. I want to use a dictionary of dataframes/tables as payload in messages. It's posssible with pyarrow to share data in this way between python microservices, but i…
gabomgp
  • 769
  • 1
  • 10
  • 23
4
votes
0 answers

A factor column used for Hive Partitioning in write_dataset() becomes a chr column in open_dataset()

In my R code, my data has an Id column which it stores as a factor (its values are a combination of characters and digits), and this works well. I am storing data in an arrow dataset that is being partitioned by the Id column, which works naturally…
James
  • 226
  • 4
  • 14
4
votes
1 answer

arrow_binary data type in R df

I'm trying to create an R dataframe by using arrow's read_parquet function. The parquet file is stored in S3. When I read in the file, many of the columns are of type arrow_binary. How can I read in these columns as strings?
Stuart
  • 41
  • 1
4
votes
1 answer

Random sampling of parquet prior to collect

I want to randomly sample a dataset. If I already have that dataset loaded, I can do something like this: library(dplyr) set.seed(-1) mtcars %>% slice_sample(n = 3) # mpg cyl disp hp drat wt qsec vs am gear carb # AMC Javelin …
Dan
  • 11,370
  • 4
  • 43
  • 68
4
votes
2 answers

How to sort a Pyarrow table?

How do I sort an Arrow table in PyArrow? There does not appear to be a single function that will do this, the closest is sort_indices.
Contango
  • 76,540
  • 58
  • 260
  • 305
4
votes
2 answers

How to connect to parquet files in Azure Blob Storage with arrow::open_dataset?

I am open to other ways of doing this. Here are my constraints: I have parquet files in a container in Azure Blob Storage These parquet files will be partitioned by a product id, as well as the date (year/month/day) I am doing this in R, and want…
4
votes
3 answers

How would I go about converting a .csv to an .arrow file without loading it all into memory?

I found a similar question here: Read CSV with PyArrow In this answer it references sys.stdin.buffer and sys.stdout.buffer, but I am not exactly sure how that would be used to write the .arrow file, or name it. I can't seem to find the exact…
kasbah512
  • 43
  • 4
4
votes
2 answers

How to partition a large julia DataFrame to an arrow file and process each partition sequentially when reading the data

I am working with very large DataFrames in Julia resulting in out of memory errors when I do joins and other manipulations on the data. Fortunately the the data can be partitioned on an identifier column. I want to persist the partitioned DataFrame…
Kobus Herbst
  • 415
  • 2
  • 12
4
votes
1 answer

How to concatenate Apache Arrow files with identical structure in julia

How can I concatenate several Arrow files with identical structure into a single Arrow file without reading each file into memory? I am using Arrow.jl and the Arrow files represent dataframes with identical structure and the combined dataframes are…
Kobus Herbst
  • 415
  • 2
  • 12
4
votes
1 answer

Unable to filter DataFrame created from Arrow table

I have the following function in julia, to read an Arrow file (using Arrow.jl) to read data from disk and process it: function getmembershipsdays(fromId, toId) memberships = Arrow.Table("HouseholdMemberships.arrow") |> DataFrame …
Kobus Herbst
  • 415
  • 2
  • 12
4
votes
3 answers

How to write a pandas dataframe to .arrow file

How can I write a pandas dataframe to disk in .arrow format? I'd like to be able to read the arrow file into Arquero as demonstrated here.
RobinL
  • 11,009
  • 8
  • 48
  • 68