Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

For installation details see this

595 questions
3
votes
1 answer

Is there a way to read a nested column?

I have a bunch of Newline-delimited JSON files that I want to read into R using the arrow package. One of the parameters in the file is nested. The potential nested values are quite big and messy and I would prefer to only select the nested…
Mike.Gahan
  • 4,565
  • 23
  • 39
3
votes
1 answer

Can I stream data into a partitioned parquet (arrow) dataset from a database or another file?

I work with tables that are tens or hundreds of gigabytes in size. They are in a postgres database. I have also dumped them to CSV files. I would like to build a partitioned parquet dataset. I wrote a script that does exactly what I want for a small…
abalter
  • 9,663
  • 17
  • 90
  • 145
3
votes
1 answer

Export a SQLite table to Apache parquet without creating a dataframe

I have multiple huge CSV files that I have to export based on Apache Parquet format and split them into smaller files based on multiple criteria/keys (= column values). As I understand Apache arrow is the R package allowing to work with Apache…
user17911
  • 1,073
  • 1
  • 8
  • 18
3
votes
1 answer

What is the best way to send Arrow data to the browser?

I have Apache Arrow data on the server (Python) and need to use it in the browser. It appears that Arrow Flight isn't implemented in JS. What are the best options for sending the data to the browser and using it there? I don't even need it…
Brian
  • 563
  • 1
  • 5
  • 15
3
votes
2 answers

rlang::hash cannot differentiate between arrow queries

I use the memoise package to cache queries to an arrow dataset but I sometimes get mismatches/"collisions" in hashes and therefore the wrong values are returned. I have isolated the problem and replicated it in the MWE below. The issue is that the…
David
  • 9,216
  • 4
  • 45
  • 78
3
votes
1 answer

Proper way to update an arrow dataset in R

I would like to know if there is a good practice to update an arrow dataset. Imagine I have data that I first write as follow: suppressMessages(library(dplyr)) suppressMessages(library(arrow)) td <- tempdir() head(mtcars) #> mpg…
Philippe Massicotte
  • 1,321
  • 12
  • 24
3
votes
0 answers

How to write a uint64 to a parquet file with logical type DECIMAL

How do I write a uint64_t value with a logical type of DECIMAL(30, 0) and physical type of FIXED_LEN_BYTE_ARRAY to a parquet file? I describe my attempt below: Because parquet::StreamWriter requires any FIXED_LEN_BYTE_ARRAY columns to have a…
1step1leap
  • 535
  • 1
  • 6
  • 11
3
votes
1 answer

Write Delta Encoded Parquet Files

I know that Apache Arrow Parquet can read spec compliant Delta encoded files, but can not write them out. I am wondering if there is any commonly used open source C++/Python library that can write out Parquet spec compliant delta encoding.
cogle
  • 997
  • 1
  • 12
  • 25
3
votes
1 answer

apache arrow c++ ParquetFileWriter problem with footer and close

I tried to have my program write out a stream of data in parquet format via apache arrow's StreamWriter. But the output file do not have the metadata footer. When trying to read in the parquet using python pandas, I get the following error: Invalid:…
michaelgbj
  • 290
  • 1
  • 10
3
votes
2 answers

Trying to save a DataFrame using Arrow.jl gives: ArgumentError: type does not have a definite number of fields. Tuples of tuples of ints

I have a dataframe that I'd like to save using Arrow.write(). I can save a subframe of it by omitting one column. But if I leave the column in, I get this error: ArgumentError: type does not have a definite number of fields The objects in this…
3
votes
1 answer

How to join 2 Arrow tables?

I want to Join two Arrow tables on a common attribute. Does Arrow have some C++ API to achieve the same? I did find something called HashJoin but I am not sure if that can be used to join 2 tables. Any pointers on this would be immensely helpful.
3
votes
1 answer

Creating ArrayBuilders in a Loop

Is there any way to create a dynamic container of arrow::ArrayBuilder objects? Here is an example int main(int argc, char** argv) { std::size_t rowCount = 5; arrow::MemoryPool* pool = arrow::default_memory_pool(); …
Will Ayd
  • 6,767
  • 2
  • 36
  • 39
3
votes
1 answer

Readback KeyValueMetadata from Field and Schema in pyarrow from file written in C++

If I write a simple Parquet file using the script simple-write-parquet.cpp, I expect to have a simple Parquet file with a single column MyInt. The script simple-write-parquet.cpp attempts to add KeyValueMetadata to the field MyInt with some dummy…
dantrim
  • 31
  • 1
3
votes
2 answers

Indexing in datafusion

Context: I am using datafusion to build a data validator for a csv file input. Requirement: I want to add row number where the error occurred in output report. In pandas, I have ability to add row index which can be used for this purpose. Is there a…
praveent
  • 562
  • 3
  • 10
3
votes
1 answer

How do I debug OverflowError: value too large to convert to int32_t?

What I am trying to do I am using PyArrow to read some CSVs and convert them to Parquet. Some of the files I read have plenty of columns and have a high memory footprint (enough to crash the machine running the job) so I am chunk-reading through the…
alt-f4
  • 2,112
  • 17
  • 49