Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

Arrow memory format supports zero-copy reads for lightning-fast data access without serialization overhead.
Columnar layout of data also allows for a better use of CPU caches by placing all data relevant to a column operation in as compact of a format as possible.
Arrow acts as a new high-performance interface between various systems. It is also focused on supporting a wide variety of industry-standard programming languages. Java, C, C++, Python are underway and more languages are expected soon.

For installation details see this

595 questions

votes

1 answer

Is there a way to read a nested column?

I have a bunch of Newline-delimited JSON files that I want to read into R using the arrow package. One of the parameters in the file is nested. The potential nested values are quite big and messy and I would prefer to only select the nested…

r json apache-arrow

asked Mar 27 '23 at 23:01

Mike.Gahan

4,565
23
39

votes

1 answer

Can I stream data into a partitioned parquet (arrow) dataset from a database or another file?

I work with tables that are tens or hundreds of gigabytes in size. They are in a postgres database. I have also dumped them to CSV files. I would like to build a partitioned parquet dataset. I wrote a script that does exactly what I want for a small…

r csv parquet partitioning apache-arrow

asked Mar 14 '23 at 01:12

abalter

9,663
17
90
145

votes

1 answer

Export a SQLite table to Apache parquet without creating a dataframe

I have multiple huge CSV files that I have to export based on Apache Parquet format and split them into smaller files based on multiple criteria/keys (= column values). As I understand Apache arrow is the R package allowing to work with Apache…

r sqlite parquet sqldf apache-arrow

asked Feb 08 '23 at 17:07

user17911

1,073
1
8
18

votes

1 answer

What is the best way to send Arrow data to the browser?

I have Apache Arrow data on the server (Python) and need to use it in the browser. It appears that Arrow Flight isn't implemented in JS. What are the best options for sending the data to the browser and using it there? I don't even need it…

serialization apache-arrow apache-arrow-flight

asked Jan 03 '23 at 21:58

Brian

votes

2 answers

rlang::hash cannot differentiate between arrow queries

I use the memoise package to cache queries to an arrow dataset but I sometimes get mismatches/"collisions" in hashes and therefore the wrong values are returned. I have isolated the problem and replicated it in the MWE below. The issue is that the…

r dplyr rlang apache-arrow memoise

asked Nov 25 '22 at 09:26

David

9,216
4
45
78

votes

1 answer

Proper way to update an arrow dataset in R

I would like to know if there is a good practice to update an arrow dataset. Imagine I have data that I first write as follow: suppressMessages(library(dplyr)) suppressMessages(library(arrow)) td <- tempdir() head(mtcars) #> mpg…

r apache-arrow

asked Aug 31 '22 at 11:20

Philippe Massicotte

1,321
12
24

votes

0 answers

How to write a uint64 to a parquet file with logical type DECIMAL

How do I write a uint64_t value with a logical type of DECIMAL(30, 0) and physical type of FIXED_LEN_BYTE_ARRAY to a parquet file? I describe my attempt below: Because parquet::StreamWriter requires any FIXED_LEN_BYTE_ARRAY columns to have a…

c++ parquet pyarrow apache-arrow

asked Jul 23 '22 at 01:29

1step1leap

votes

1 answer

Write Delta Encoded Parquet Files

I know that Apache Arrow Parquet can read spec compliant Delta encoded files, but can not write them out. I am wondering if there is any commonly used open source C++/Python library that can write out Parquet spec compliant delta encoding.

parquet apache-arrow

asked Jun 08 '22 at 00:53

cogle

votes

1 answer

apache arrow c++ ParquetFileWriter problem with footer and close

I tried to have my program write out a stream of data in parquet format via apache arrow's StreamWriter. But the output file do not have the metadata footer. When trying to read in the parquet using python pandas, I get the following error: Invalid:…

c++ parquet apache-arrow

asked Mar 16 '22 at 18:12

michaelgbj

votes

2 answers

Trying to save a DataFrame using Arrow.jl gives: ArgumentError: type does not have a definite number of fields. Tuples of tuples of ints

I have a dataframe that I'd like to save using Arrow.write(). I can save a subframe of it by omitting one column. But if I leave the column in, I get this error: ArgumentError: type does not have a definite number of fields The objects in this…

dataframe tuples julia apache-arrow julia-dataframe

asked Jan 22 '22 at 00:50

Sort of Damocles

votes

1 answer

How to join 2 Arrow tables?

I want to Join two Arrow tables on a common attribute. Does Arrow have some C++ API to achieve the same? I did find something called HashJoin but I am not sure if that can be used to join 2 tables. Any pointers on this would be immensely helpful.

pyarrow apache-arrow

asked Dec 17 '21 at 00:39

Jayjeet Chakraborty

votes

1 answer

Creating ArrayBuilders in a Loop

Is there any way to create a dynamic container of arrow::ArrayBuilder objects? Here is an example int main(int argc, char** argv) { std::size_t rowCount = 5; arrow::MemoryPool* pool = arrow::default_memory_pool(); …

c++ apache-arrow

asked Nov 19 '21 at 23:59

Will Ayd

6,767
2
36
39

votes

1 answer

Readback KeyValueMetadata from Field and Schema in pyarrow from file written in C++

If I write a simple Parquet file using the script simple-write-parquet.cpp, I expect to have a simple Parquet file with a single column MyInt. The script simple-write-parquet.cpp attempts to add KeyValueMetadata to the field MyInt with some dummy…

parquet pyarrow apache-arrow

asked Aug 13 '21 at 21:44

dantrim

votes

2 answers

Indexing in datafusion

Context: I am using datafusion to build a data validator for a csv file input. Requirement: I want to add row number where the error occurred in output report. In pandas, I have ability to add row index which can be used for this purpose. Is there a…

rust apache-arrow apache-arrow-datafusion

asked Aug 07 '21 at 12:59

praveent

votes

1 answer

How do I debug OverflowError: value too large to convert to int32_t?

What I am trying to do I am using PyArrow to read some CSVs and convert them to Parquet. Some of the files I read have plenty of columns and have a high memory footprint (enough to crash the machine running the job) so I am chunk-reading through the…

python pyarrow apache-arrow

asked Aug 04 '21 at 13:29

alt-f4

2,112
17
49

Prev 1 2 3

…

39 40 Next