Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

Arrow memory format supports zero-copy reads for lightning-fast data access without serialization overhead.
Columnar layout of data also allows for a better use of CPU caches by placing all data relevant to a column operation in as compact of a format as possible.
Arrow acts as a new high-performance interface between various systems. It is also focused on supporting a wide variety of industry-standard programming languages. Java, C, C++, Python are underway and more languages are expected soon.

For installation details see this

595 questions

votes

1 answer

How reproducible / deterministic is Parquet format?

I'm seeking advice from people deeply familiar with the binary layout of Apache Parquet: Having a data transformation F(a) = b where F is fully deterministic, and same exact versions of the entire software stack (framework, arrow & parquet…

parquet apache-arrow

asked Dec 03 '21 at 21:41

sergey

votes

1 answer

How to read/write partitioned Apache Arrow or Parquet files into/out of Julia

I am trying to read and write a trivial dataset into Julia. The dataset is mtcars, taken from R, with an arbitrarily added column bt with random Boolean values. The file/folder structure (below) was written out using the R arrow package. The files…

julia parquet apache-arrow

asked May 17 '21 at 17:26

tinker

votes

2 answers

Proper Syntax for Filtering Expressions for Arrow Datasets in R

I am attempting to use the arrow package (relatively recently implemented) DataSet API to to read a directory of files into memory, and leverage the c++ back-end to filter rows and columns. I would like to use the arrow package functions directly,…

r apache-arrow

asked Apr 28 '21 at 15:26

Matt Summersgill

4,054
18
47

votes

2 answers

AWS EMR - ModuleNotFoundError: No module named 'pyarrow'

I am running into this problem w/ Apache Arrow Spark Integration. Using AWS EMR w/ Spark 2.4.3 Tested this problem on both local spark single machine instance and a Cloudera cluster and everything works fine. set these in spark-env.sh export…

apache-spark pyspark amazon-emr pyarrow apache-arrow

asked Aug 01 '19 at 18:28

thePurplePython

2,621
1
13
34

votes

2 answers

Datatypes issue when convert parquet data to pandas dataframe

I have a problem with filetypes when converting a parquet file to a dataframe. I do bucket = 's3://some_bucket/test/usages' import pyarrow.parquet as pq import s3fs s3 = s3fs.S3FileSystem() read_pq = pq.ParquetDataset(bucket,…

pandas parquet pyarrow apache-arrow

asked Feb 25 '19 at 12:45

clog14

1,549
1
16
32

votes

1 answer

Convert Pandas DataFrame to & from In-Memory Feather

Using the IO tools in pandas it is possible to convert a DataFrame to an in-memory feather buffer: import pandas as pd from io import BytesIO df = pd.DataFrame({'a': [1,2], 'b': [3.0,4.0]}) buf = BytesIO() df.to_feather(buf) However, using…

python python-3.x pandas feather apache-arrow

asked Jun 08 '18 at 13:31

Ramón J Romero y Vigil

17,373
7
77
125

votes

1 answer

PySpark: Invalid returnType with scalar Pandas UDFs

I'm trying to return a specific structure from a pandas_udf. It worked on one cluster but fails on another. I try to run a udf on groups, which requires the return type to be a data frame. from pyspark.sql.functions import pandas_udf import pandas…

apache-spark pyspark apache-arrow

asked Mar 26 '18 at 11:10

Omri374

2,555
3
26
40

votes

1 answer

How can I chunk through a CSV using Arrow?

What I am trying to do I am using PyArrow to read some CSVs and convert them to Parquet. Some of the files I read have plenty of columns and have a high memory footprint (enough to crash the machine running the job). I am trying to chunk through the…

python pyarrow apache-arrow

asked Jul 28 '21 at 05:54

alt-f4

2,112
17
49

votes

1 answer

How to pass array column as argument in VectorUdf in .Net Spark?

I'm trying to implement Vector Udf in C# Spark. I have created .Net Spark environment by following Spark .Net. Vector Udf (Apache arrow and Microsoft.Data.Analysis both) worked for me for IntegerType column. Now, trying to send the Integer array…

c# apache-spark user-defined-functions apache-arrow .net-spark

asked Mar 25 '21 at 07:38

WPFUser

1,145
7
24

votes

1 answer

How does apache arrow facilitate "No overhead for cross-system communication"?

I've been very interested in Apache Arrow for a bit now due to the promises of "zero copy reads", "zero serde", and "No overhead for cross-system communication". My understanding of the project (through the lens of pyarrow) is that it describes the…

python pyspark rust pyarrow apache-arrow

asked Sep 17 '19 at 00:54

kemri

votes

1 answer

How to convert arrow::Array to std::vector?

I have an Apache arrow array that is created by reading a file. std::shared_ptr array; PARQUET_THROW_NOT_OK(reader->ReadColumn(0, &array)); Is there a way to convert it to std::vector or any other native array type in C++?

c++ arrays vector apache-arrow

asked Nov 17 '18 at 00:21

motam79

3,542
5
34
60

votes

0 answers

How to write a simple, unwrapped, byte array to an Apache-Arrow ListWriter

I'm currently writing some code to convert an arbitrary data structure to Apache Arrow vectors and got stuck on something relatively simple, namely, how to write a byte[] to a ListVector. When writing data to a ListVector through a…

java apache-arrow

asked Oct 30 '17 at 08:03

Shastick

1,218
1
12
29

votes

1 answer

How to load a CSV file into Apache Arrow vectors and save an arrow file to disk

I'm currently playing with Apache Arrow's java API (though I use it from Scala for the code samples) to get some familiarity with this tool. As an exercise, I chose to load a CSV file into arrow vectors and then to save these to an arrow file. The…

java scala csv apache-arrow

asked Oct 23 '17 at 09:53

Shastick

1,218
1
12
29

votes

1 answer

Deterministic random number generation in duckdb with dplyr syntax

How can I use duckdb's setseed() function (see reference doc) with dplyr syntax to make sure the analysis below is reproducible? # dplyr version 1.1.1 # arrow version 11.0.0.3 # duckdb 0.7.1.1 out_dir <- tempfile() arrow::write_dataset(mtcars,…

r dplyr apache-arrow duckdb

asked Aug 28 '23 at 01:07

Ashirwad

1,890
1
12
14

votes

1 answer

Pass Arrow data from Node.js to Rust without copy

What is the best way to pass data using the Apache Arrow format from Node.js to Rust? Storing the data in each language is easy enough, but its the sharing memory that is giving me challenges. I'm using Napi-rs to generate the node.js API…

node.js rust apache-arrow rust-arrow2

asked Aug 12 '23 at 02:36

lostAstronaut

1,331
5
20
34

Prev 1

…

39 40 Next