Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

For installation details see this

595 questions
7
votes
1 answer

How reproducible / deterministic is Parquet format?

I'm seeking advice from people deeply familiar with the binary layout of Apache Parquet: Having a data transformation F(a) = b where F is fully deterministic, and same exact versions of the entire software stack (framework, arrow & parquet…
sergey
  • 73
  • 4
7
votes
1 answer

How to read/write partitioned Apache Arrow or Parquet files into/out of Julia

I am trying to read and write a trivial dataset into Julia. The dataset is mtcars, taken from R, with an arbitrarily added column bt with random Boolean values. The file/folder structure (below) was written out using the R arrow package. The files…
tinker
  • 96
  • 2
7
votes
2 answers

Proper Syntax for Filtering Expressions for Arrow Datasets in R

I am attempting to use the arrow package (relatively recently implemented) DataSet API to to read a directory of files into memory, and leverage the c++ back-end to filter rows and columns. I would like to use the arrow package functions directly,…
Matt Summersgill
  • 4,054
  • 18
  • 47
7
votes
2 answers

AWS EMR - ModuleNotFoundError: No module named 'pyarrow'

I am running into this problem w/ Apache Arrow Spark Integration. Using AWS EMR w/ Spark 2.4.3 Tested this problem on both local spark single machine instance and a Cloudera cluster and everything works fine. set these in spark-env.sh export…
thePurplePython
  • 2,621
  • 1
  • 13
  • 34
7
votes
2 answers

Datatypes issue when convert parquet data to pandas dataframe

I have a problem with filetypes when converting a parquet file to a dataframe. I do bucket = 's3://some_bucket/test/usages' import pyarrow.parquet as pq import s3fs s3 = s3fs.S3FileSystem() read_pq = pq.ParquetDataset(bucket,…
clog14
  • 1,549
  • 1
  • 16
  • 32
7
votes
1 answer

Convert Pandas DataFrame to & from In-Memory Feather

Using the IO tools in pandas it is possible to convert a DataFrame to an in-memory feather buffer: import pandas as pd from io import BytesIO df = pd.DataFrame({'a': [1,2], 'b': [3.0,4.0]}) buf = BytesIO() df.to_feather(buf) However, using…
Ramón J Romero y Vigil
  • 17,373
  • 7
  • 77
  • 125
7
votes
1 answer

PySpark: Invalid returnType with scalar Pandas UDFs

I'm trying to return a specific structure from a pandas_udf. It worked on one cluster but fails on another. I try to run a udf on groups, which requires the return type to be a data frame. from pyspark.sql.functions import pandas_udf import pandas…
Omri374
  • 2,555
  • 3
  • 26
  • 40
6
votes
1 answer

How can I chunk through a CSV using Arrow?

What I am trying to do I am using PyArrow to read some CSVs and convert them to Parquet. Some of the files I read have plenty of columns and have a high memory footprint (enough to crash the machine running the job). I am trying to chunk through the…
alt-f4
  • 2,112
  • 17
  • 49
6
votes
1 answer

How to pass array column as argument in VectorUdf in .Net Spark?

I'm trying to implement Vector Udf in C# Spark. I have created .Net Spark environment by following Spark .Net. Vector Udf (Apache arrow and Microsoft.Data.Analysis both) worked for me for IntegerType column. Now, trying to send the Integer array…
6
votes
1 answer

How does apache arrow facilitate "No overhead for cross-system communication"?

I've been very interested in Apache Arrow for a bit now due to the promises of "zero copy reads", "zero serde", and "No overhead for cross-system communication". My understanding of the project (through the lens of pyarrow) is that it describes the…
kemri
  • 149
  • 12
6
votes
1 answer

How to convert arrow::Array to std::vector?

I have an Apache arrow array that is created by reading a file. std::shared_ptr array; PARQUET_THROW_NOT_OK(reader->ReadColumn(0, &array)); Is there a way to convert it to std::vector or any other native array type in C++?
motam79
  • 3,542
  • 5
  • 34
  • 60
6
votes
0 answers

How to write a simple, unwrapped, byte array to an Apache-Arrow ListWriter

I'm currently writing some code to convert an arbitrary data structure to Apache Arrow vectors and got stuck on something relatively simple, namely, how to write a byte[] to a ListVector. When writing data to a ListVector through a…
Shastick
  • 1,218
  • 1
  • 12
  • 29
6
votes
1 answer

How to load a CSV file into Apache Arrow vectors and save an arrow file to disk

I'm currently playing with Apache Arrow's java API (though I use it from Scala for the code samples) to get some familiarity with this tool. As an exercise, I chose to load a CSV file into arrow vectors and then to save these to an arrow file. The…
Shastick
  • 1,218
  • 1
  • 12
  • 29
5
votes
1 answer

Deterministic random number generation in duckdb with dplyr syntax

How can I use duckdb's setseed() function (see reference doc) with dplyr syntax to make sure the analysis below is reproducible? # dplyr version 1.1.1 # arrow version 11.0.0.3 # duckdb 0.7.1.1 out_dir <- tempfile() arrow::write_dataset(mtcars,…
Ashirwad
  • 1,890
  • 1
  • 12
  • 14
5
votes
1 answer

Pass Arrow data from Node.js to Rust without copy

What is the best way to pass data using the Apache Arrow format from Node.js to Rust? Storing the data in each language is easy enough, but its the sharing memory that is giving me challenges. I'm using Napi-rs to generate the node.js API…
lostAstronaut
  • 1,331
  • 5
  • 20
  • 34
1
2
3
39 40