Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

For installation details see this

595 questions
1
vote
2 answers

How to implement modulo operation using PyArrow Expression API so that I can use it in filter?

I want to shard Arrow Dataset. To achieve that, I'd like to use a monotonously increasing field and implement a sharding operation in the following filter, which I can use in pyarrow Scanner: pc.field('id') % num_shards == shard_id Any ideas on how…
qwertz1123
  • 1,173
  • 10
  • 27
1
vote
0 answers

How to read dictionary_page values in a parquet file?

I saw dictionary_page_offset and has_dictionary_page in the document but couldn't find a way to read the dictionary values. MY current code not sure is right way. for group in range(pq_file.metadata.num_row_groups): for col in…
naikordian
  • 11
  • 1
1
vote
1 answer

Check if an arrow Array created from an atomic vector makes a copy of that vector

I am trying to check if creating an arrow Array from an r object creates a copy or not. I create an array and then create an atomic vector back from that array but the memory adress doesn't seem to be the same... Is there something I am doing…
daniellga
  • 1,142
  • 6
  • 16
1
vote
0 answers

Polars: how to create a dataframe from arrow RecordBatch?

I am using rust arrow library and have a RecordBatch struct. I want to create a Polars dataframe out it, do some operations in the Polars land, and move the result back to RecordBatch. Since both are arrow based, I suppose there might be an…
Nikhil Garg
  • 3,944
  • 9
  • 30
  • 37
1
vote
0 answers

Reading Apache Arrow files in Spark

I am using Pyspark and I would like to read files of type Apache Arrow, which have ".arrow" as extension. I unfortunately couldn't find any way to do this, would be grateful for any help.
Johnas
  • 296
  • 2
  • 5
  • 15
1
vote
1 answer

PyArrow: How to batch data from mongo into partitioned parquet in S3

I want to be able to archive my data from Mongo into S3. Currently, what I do is Read data from Mongo Convert this into a pyarrow Table Write to S3 It works for now, but steps 1 and 2 is kind of a bulk thing where if the result set is huge it…
Jiew Meng
  • 84,767
  • 185
  • 495
  • 805
1
vote
1 answer

How can I write an .arrow/.arrows file with several batches?

As a part of my current task, I need to write several batches to .arrow/.arrows file and then read the data from it. How can I do that? Now I am doing something like this: private static VectorSchemaRoot addData(int count) { try…
Don_Quijote
  • 936
  • 3
  • 18
  • 27
1
vote
1 answer

Different results of a full_join in arrow and dplyr

I get different results when using full_join on tibble and on arrow_table. Maybe somebody can give a hand on what is going on? library(arrow) library(dplyr) xa1 <- arrow_table(x = 1L) xa2 <- arrow_table(x = 2L) x1 <- tibble(x = 1L) x2 <- tibble(x…
Vitalijs
  • 938
  • 7
  • 18
1
vote
0 answers

ApacheArrow for Trie Data

I am seeking on using Arrow for read-heavy operations on trie data structures. I'm slightly hesitant with using Arrow since I can't really see a natural representation of the data in terms of columns. Specifically, the data I work with can be viewed…
Ian_L
  • 93
  • 4
1
vote
0 answers

apache arrow java parquet to csv conversion and vice versa

I am looking for an example of converting a parquet file to csv and vice versa. I am missing things like how to set the delimiter (comma or tab), how to specify date and timestamp formats (when reading from csv) and things like that. Looked at arrow…
YaOg
  • 1,748
  • 5
  • 24
  • 43
1
vote
1 answer

pyarrow pq.ParquetFile and related functions throw OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit error

As part of an analysis pipeline I am using around 60 000 parquet files containing each one line of data which must be concatenated. Each file can contain a different set of columns and I need to uniformise them before concatenating them with Ray…
1
vote
0 answers

R arrow, open_dataset %>% select(myvars) %>% collect causing memory leak

UPDATE: cross-posted as arrow bug on JIRA (fingers crossed for arrow developer helpsoon) I am having trouble using arrow in R. First, I saved some data.tables (d) that were about 50-60Gb in memory to a parquet file using: d %>% write_dataset(f,…
LucasMation
  • 2,408
  • 2
  • 22
  • 45
1
vote
1 answer

R+arrow: Error when using the dataset API

Please have a look at the snippet at the end of the file. I am making my first baby steps with arrow and R to deal with files which are too large to be loaded in memory. I am trying to reproduce the steps…
larry77
  • 1,309
  • 14
  • 29
1
vote
1 answer

Very strange error I got wrote the arrow table to the disk

I am learning the Apache Arrow for R and met the following issue. My dataset has 85+ million rows what makes the utilizing of Arrow really useful. I do the following very simple steps. Open the existing dataset as arrow table Sales_data <-…
grislepak
  • 31
  • 3
1
vote
1 answer

r arrow schema update

I have multiple .csv files that I am trying to read with arrow::open_dataset() but it is throwing an error due to column type inconsistency. I found this question mostly related to my problem, but I am trying a slightly different approach. I want…
Matthew Son
  • 1,109
  • 8
  • 27