Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

Arrow memory format supports zero-copy reads for lightning-fast data access without serialization overhead.
Columnar layout of data also allows for a better use of CPU caches by placing all data relevant to a column operation in as compact of a format as possible.
Arrow acts as a new high-performance interface between various systems. It is also focused on supporting a wide variety of industry-standard programming languages. Java, C, C++, Python are underway and more languages are expected soon.

For installation details see this

595 questions

votes

1 answer

Parse CSV with far future dates to Parquet

I’m trying to read a CSV into Pandas, and then write it to Parquet. The challenge is that the CSV has a date column with a value of 3000-12-31, and apparently Pandas has no way to store that value as an actual date. Because of that, PyArrow fails to…

asked Nov 02 '20 at 16:31

Kris Harper

5,672
8
51
96

votes

2 answers

Partition by column value in rust using Arrow/Datafusion/Polars (like python panda's groupby)?

I am looking for an equivalent of the convenient python panda syntax: #df is a pandas dataframe for fruit, sub_df in df.groupby('fruits'): # Do some stuff with sub_df and fruit It is basically a groupby, where each group can be accessed as a…

rust apache-arrow

asked Oct 12 '20 at 21:21

Jeremy Cochoy

2,480
2
24
39

votes

1 answer

Reading a parquet file in nodejs

I am trying the following code (from sample of parquetjs-lite and stackoverflow) to read a parquet file in nodejs : const readParquetFile = async () => { try { // create new ParquetReader that reads from test.parquet let reader = await…

javascript node.js apache parquet apache-arrow

asked Sep 26 '20 at 08:52

user14013917

votes

1 answer

Is apache arrow.Vector.toArray() zero-copy in JS

Same as title: is toArray() a zero-copy memory cast, in effect? Is there a way to find out this sort of things without asking on forums? Thanks.

apache-arrow

asked Sep 19 '20 at 22:41

phantom bridge

votes

1 answer

Combining TSV files to create a new TSV for Apache Arrow table

I have two TSV files (header.tsv & data.tsv) header.tsv holds 1000+ column names and data.tsv holds ~50K records (with NULL column values too). I would like to create a new TSV file (let's say combined.tsv) by appending data.tsv file to header.tsv.…

python shell csv pyarrow apache-arrow

asked Sep 17 '20 at 00:55

Gou7haM

votes

0 answers

Need explanation on internal working of read_table method in pyarrow.parquet

I stored all the required parquet tables in a Hadoop Filesystem, and all these files have a unique path for identification. These paths are pushed into a RabbitMQ queue as a JSON and is consumed by the consumer (in CherryPy) for processing. After…

hadoop rabbitmq parquet pyarrow apache-arrow

asked Sep 10 '20 at 03:52

Blackdeath

votes

0 answers

pyarrow version 1.0 bug throws Out Of Memory exception while reading large number of files using ParquetDataset (works fine with version 0.13)

I have a dataframe split and stored in more than 5000 files. I use ParquetDataset(fnames).read() to load all files. I updated the pyarrow to latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of memory: malloc of size 131072…

python pyarrow apache-arrow

asked Sep 08 '20 at 11:22

ashish

votes

1 answer

PyArrow not Writing to Feather or Parquet

So looking at the docs for write_feather I should be able to write an Arrow table as follows. import pyarrow as pa import pyarrow.feather as fe fe.write_feather( pa.Table.from_arrays([ pa.array([1,2,3]) ], names=['value']), 'file.feather' ) But…

python pyarrow apache-arrow

asked Sep 01 '20 at 07:44

fny

31,255
16
96
127

votes

1 answer

coerce_timestamps does not work in write_parquet in Apache arrow R package

i am trying to truncate timestamps to milliseconds when writing a parquet file. with: tutu <- as.POSIXct("2020/06/03 18:00:00",tz = "UTC") if i do: write_parquet(data.frame(tutu),"~/Downloads/tutu.test.parquet") i get 1591207200000000 if i…

r apache-arrow

asked Jun 23 '20 at 21:36

slim

votes

1 answer

Q: Convert PARQUET to JSON in C++ by Apache Arrow?

I can read the Parquet file to Arrow::table in c++ now, but have no idea how to convert them to the JSON file. Is there any example to do that by Apache Arrow or something convenient else? Thanks~

c++ parquet apache-arrow

asked Jun 18 '20 at 08:43

annsshadow

votes

1 answer

How to read Parquet "group"/list field in cpp?

I am new to cpp. I have some Parquet with part of its schema like this: optional binary one (UTF8); optional group two (LIST) { repeated group list { optional binary element (UTF8); } } optional binary three (UTF8); I'm using…

c++ parquet apache-arrow

asked Jun 01 '20 at 02:00

Alex Moore-Niemi

2,913
2
24
22

votes

1 answer

Q: Apache Arrow array builder UnsafeAppend

I am working on the array builder UnsafeAppend api. According to the code in the document. arrow::Int64Builder builder; // Make place for 8 values in…

c++ apache-arrow

asked May 17 '20 at 13:05

user147852369

votes

1 answer

How to append to existing apache arrow array

I can create an arrow array with a builder: extern crate arrow; use arrow::array::Int16Array; // Create a new builder with a capacity of 100 let mut builder = Int16Array::builder(100); // Append a slice of primitive…

rust apache-arrow

asked May 11 '20 at 06:30

ritchie46

10,405
1
24
43

votes

1 answer

How can I get the last value of the repeated field in each row of Parquet file in Apache Arrow?

Assume that I am doing something with each row of a Parquet file and each row has a field named myList which is repeated and string. How can I get the last value in the myList of each row? This example uses a vector to store all values. Is there…

c++ apache-arrow

asked Apr 26 '20 at 13:20

annsshadow

votes

1 answer

Apache Arrow with Tensorflow: Type error: Arrow type mismatch: expected dtype=2, but got dtype=9

I am learning Arrow combined with TensorFlow, according to this blog, I wrote an example of mnist. My question is why it is necessary to preprocess the data, otherwise, it will report an error:…

python-3.x tensorflow tensorflow2.0 pyarrow apache-arrow

asked Mar 30 '20 at 04:06

maqy

Prev 1 2 3

…

39 40 Next