Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

Arrow memory format supports zero-copy reads for lightning-fast data access without serialization overhead.
Columnar layout of data also allows for a better use of CPU caches by placing all data relevant to a column operation in as compact of a format as possible.
Arrow acts as a new high-performance interface between various systems. It is also focused on supporting a wide variety of industry-standard programming languages. Java, C, C++, Python are underway and more languages are expected soon.

For installation details see this

595 questions

vote

2 answers

How to implement modulo operation using PyArrow Expression API so that I can use it in filter?

I want to shard Arrow Dataset. To achieve that, I'd like to use a monotonously increasing field and implement a sharding operation in the following filter, which I can use in pyarrow Scanner: pc.field('id') % num_shards == shard_id Any ideas on how…

pyarrow apache-arrow

asked Jan 04 '23 at 17:34

qwertz1123

1,173
10
27

vote

0 answers

How to read dictionary_page values in a parquet file?

I saw dictionary_page_offset and has_dictionary_page in the document but couldn't find a way to read the dictionary values. MY current code not sure is right way. for group in range(pq_file.metadata.num_row_groups): for col in…

python pyarrow apache-arrow

asked Dec 28 '22 at 05:07

naikordian

vote

1 answer

Check if an arrow Array created from an atomic vector makes a copy of that vector

I am trying to check if creating an arrow Array from an r object creates a copy or not. I create an array and then create an atomic vector back from that array but the memory adress doesn't seem to be the same... Is there something I am doing…

r apache-arrow

asked Dec 22 '22 at 13:52

daniellga

1,142
6
16

vote

0 answers

Polars: how to create a dataframe from arrow RecordBatch?

I am using rust arrow library and have a RecordBatch struct. I want to create a Polars dataframe out it, do some operations in the Polars land, and move the result back to RecordBatch. Since both are arrow based, I suppose there might be an…

rust apache-arrow rust-polars

asked Dec 06 '22 at 05:04

Nikhil Garg

3,944
9
30
37

vote

0 answers

Reading Apache Arrow files in Spark

I am using Pyspark and I would like to read files of type Apache Arrow, which have ".arrow" as extension. I unfortunately couldn't find any way to do this, would be grateful for any help.

pyspark apache-arrow

asked Dec 04 '22 at 18:03

Johnas

vote

1 answer

PyArrow: How to batch data from mongo into partitioned parquet in S3

I want to be able to archive my data from Mongo into S3. Currently, what I do is Read data from Mongo Convert this into a pyarrow Table Write to S3 It works for now, but steps 1 and 2 is kind of a bulk thing where if the result set is huge it…

mongodb amazon-s3 parquet pyarrow apache-arrow

asked Nov 23 '22 at 02:43

Jiew Meng

84,767
185
495
805

vote

1 answer

How can I write an .arrow/.arrows file with several batches?

As a part of my current task, I need to write several batches to .arrow/.arrows file and then read the data from it. How can I do that? Now I am doing something like this: private static VectorSchemaRoot addData(int count) { try…

java apache-arrow

asked Nov 21 '22 at 18:03

Don_Quijote

vote

1 answer

Different results of a full_join in arrow and dplyr

I get different results when using full_join on tibble and on arrow_table. Maybe somebody can give a hand on what is going on? library(arrow) library(dplyr) xa1 <- arrow_table(x = 1L) xa2 <- arrow_table(x = 2L) x1 <- tibble(x = 1L) x2 <- tibble(x…

r dplyr apache-arrow

asked Nov 19 '22 at 21:09

Vitalijs

vote

0 answers

ApacheArrow for Trie Data

I am seeking on using Arrow for read-heavy operations on trie data structures. I'm slightly hesitant with using Arrow since I can't really see a natural representation of the data in terms of columns. Specifically, the data I work with can be viewed…

julia apache-arrow

asked Nov 15 '22 at 06:22

Ian_L

vote

0 answers

apache arrow java parquet to csv conversion and vice versa

I am looking for an example of converting a parquet file to csv and vice versa. I am missing things like how to set the delimiter (comma or tab), how to specify date and timestamp formats (when reading from csv) and things like that. Looked at arrow…

java apache-arrow

asked Oct 31 '22 at 05:29

YaOg

1,748
5
24
43

vote

1 answer

pyarrow pq.ParquetFile and related functions throw OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit error

As part of an analysis pipeline I am using around 60 000 parquet files containing each one line of data which must be concatenated. Each file can contain a different set of columns and I need to uniformise them before concatenating them with Ray…

metadata parquet python-3.8 pyarrow apache-arrow

asked Oct 29 '22 at 16:34

Nicolas de Montigny

vote

0 answers

R arrow, open_dataset %>% select(myvars) %>% collect causing memory leak

UPDATE: cross-posted as arrow bug on JIRA (fingers crossed for arrow developer helpsoon) I am having trouble using arrow in R. First, I saved some data.tables (d) that were about 50-60Gb in memory to a parquet file using: d %>% write_dataset(f,…

r data.table apache-arrow

asked Oct 27 '22 at 11:42

LucasMation

2,408
2
22
45

vote

1 answer

R+arrow: Error when using the dataset API

Please have a look at the snippet at the end of the file. I am making my first baby steps with arrow and R to deal with files which are too large to be loaded in memory. I am trying to reproduce the steps…

r csv apache-arrow

asked Oct 21 '22 at 21:58

larry77

1,309
14
29

vote

1 answer

Very strange error I got wrote the arrow table to the disk

I am learning the Apache Arrow for R and met the following issue. My dataset has 85+ million rows what makes the utilizing of Arrow really useful. I do the following very simple steps. Open the existing dataset as arrow table Sales_data <-…

r arrow-functions apache-arrow

asked Oct 18 '22 at 17:09

grislepak

vote

1 answer

r arrow schema update

I have multiple .csv files that I am trying to read with arrow::open_dataset() but it is throwing an error due to column type inconsistency. I found this question mostly related to my problem, but I am trying a slightly different approach. I want…

r pyarrow apache-arrow

asked Oct 05 '22 at 18:13

Matthew Son

1,109
8
27

Prev 1 2 3

…

39 40 Next