Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

For installation details see this

595 questions
0
votes
1 answer

Parse CSV with far future dates to Parquet

I’m trying to read a CSV into Pandas, and then write it to Parquet. The challenge is that the CSV has a date column with a value of 3000-12-31, and apparently Pandas has no way to store that value as an actual date. Because of that, PyArrow fails to…
Kris Harper
  • 5,672
  • 8
  • 51
  • 96
0
votes
2 answers

Partition by column value in rust using Arrow/Datafusion/Polars (like python panda's groupby)?

I am looking for an equivalent of the convenient python panda syntax: #df is a pandas dataframe for fruit, sub_df in df.groupby('fruits'): # Do some stuff with sub_df and fruit It is basically a groupby, where each group can be accessed as a…
Jeremy Cochoy
  • 2,480
  • 2
  • 24
  • 39
0
votes
1 answer

Reading a parquet file in nodejs

I am trying the following code (from sample of parquetjs-lite and stackoverflow) to read a parquet file in nodejs : const readParquetFile = async () => { try { // create new ParquetReader that reads from test.parquet let reader = await…
user14013917
  • 149
  • 1
  • 10
0
votes
1 answer

Is apache arrow.Vector.toArray() zero-copy in JS

Same as title: is toArray() a zero-copy memory cast, in effect? Is there a way to find out this sort of things without asking on forums? Thanks.
0
votes
1 answer

Combining TSV files to create a new TSV for Apache Arrow table

I have two TSV files (header.tsv & data.tsv) header.tsv holds 1000+ column names and data.tsv holds ~50K records (with NULL column values too). I would like to create a new TSV file (let's say combined.tsv) by appending data.tsv file to header.tsv.…
Gou7haM
  • 1
  • 2
0
votes
0 answers

Need explanation on internal working of read_table method in pyarrow.parquet

I stored all the required parquet tables in a Hadoop Filesystem, and all these files have a unique path for identification. These paths are pushed into a RabbitMQ queue as a JSON and is consumed by the consumer (in CherryPy) for processing. After…
0
votes
0 answers

pyarrow version 1.0 bug throws Out Of Memory exception while reading large number of files using ParquetDataset (works fine with version 0.13)

I have a dataframe split and stored in more than 5000 files. I use ParquetDataset(fnames).read() to load all files. I updated the pyarrow to latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of memory: malloc of size 131072…
ashish
  • 1
  • 1
0
votes
1 answer

PyArrow not Writing to Feather or Parquet

So looking at the docs for write_feather I should be able to write an Arrow table as follows. import pyarrow as pa import pyarrow.feather as fe fe.write_feather( pa.Table.from_arrays([ pa.array([1,2,3]) ], names=['value']), 'file.feather' ) But…
fny
  • 31,255
  • 16
  • 96
  • 127
0
votes
1 answer

coerce_timestamps does not work in write_parquet in Apache arrow R package

i am trying to truncate timestamps to milliseconds when writing a parquet file. with: tutu <- as.POSIXct("2020/06/03 18:00:00",tz = "UTC") if i do: write_parquet(data.frame(tutu),"~/Downloads/tutu.test.parquet") i get 1591207200000000 if i…
slim
  • 3
  • 2
0
votes
1 answer

Q: Convert PARQUET to JSON in C++ by Apache Arrow?

I can read the Parquet file to Arrow::table in c++ now, but have no idea how to convert them to the JSON file. Is there any example to do that by Apache Arrow or something convenient else? Thanks~
0
votes
1 answer

How to read Parquet "group"/list field in cpp?

I am new to cpp. I have some Parquet with part of its schema like this: optional binary one (UTF8); optional group two (LIST) { repeated group list { optional binary element (UTF8); } } optional binary three (UTF8); I'm using…
Alex Moore-Niemi
  • 2,913
  • 2
  • 24
  • 22
0
votes
1 answer

Q: Apache Arrow array builder UnsafeAppend

I am working on the array builder UnsafeAppend api. According to the code in the document. arrow::Int64Builder builder; // Make place for 8 values in…
0
votes
1 answer

How to append to existing apache arrow array

I can create an arrow array with a builder: extern crate arrow; use arrow::array::Int16Array; // Create a new builder with a capacity of 100 let mut builder = Int16Array::builder(100); // Append a slice of primitive…
ritchie46
  • 10,405
  • 1
  • 24
  • 43
0
votes
1 answer

How can I get the last value of the repeated field in each row of Parquet file in Apache Arrow?

Assume that I am doing something with each row of a Parquet file and each row has a field named myList which is repeated and string. How can I get the last value in the myList of each row? This example uses a vector to store all values. Is there…
0
votes
1 answer

Apache Arrow with Tensorflow: Type error: Arrow type mismatch: expected dtype=2, but got dtype=9

I am learning Arrow combined with TensorFlow, according to this blog, I wrote an example of mnist. My question is why it is necessary to preprocess the data, otherwise, it will report an error:…