Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

For installation details see this

595 questions
0
votes
0 answers

How to keep trailing zero after reading from arrow files

Using apache-arrow js (https://github.com/apache/arrow/tree/master/js), I can read arrow file (or even feather file) by a few lines only. const arrow = fs.readFileSync("test.feather"); const table = apArrow.Table.from([arrow]); However I found…
appletabo
  • 239
  • 2
  • 12
0
votes
2 answers

Reading Arrow Feather files in GoLang or Javascript

I am looking for a way to read the feather files via GoLang or Javascript, or some other languages that does not require users to do some other extra installation. My goal is to provide a User-interface to read a feather csv file and convert it back…
appletabo
  • 239
  • 2
  • 12
0
votes
1 answer

Apache Arrow writing nested types in parquet with C++

unfortunately i found no c++ example which writes nested types like maps with Apache Arrow into parquet. The creation of the schemata is clear but not the arrow table creation part. Has anyboy a hint or link on an example? Many thanks in advance!
0
votes
1 answer

How can I achieve predicate pushdown when using PyArrow + Parquet + Google Cloud Storage?

What I'm really trying to do is this (in Python): import pyarrow.parquet as pq # Note the 'columns' predicate... table = pq.read_table('gs://my_bucket/my_blob.parquet', columns=['a', 'b', 'c']) First, I don't think that gs:// is supported in…
user5406764
  • 1,627
  • 2
  • 16
  • 23
0
votes
0 answers

PyArrow: mmap-backed pass-through array?

In pyarrow, what is a proper way to construct an mmap-backed pass-through array, meaning: to have a fixed-size, fixed-schema pyarrow.Array backed by a buffer, which is based on a pyarrow.MemoryMappedFile, such that changes to the array are directly…
Andrei Pozolotin
  • 897
  • 3
  • 14
  • 21
0
votes
1 answer

How to specify which columns to load in pyarrow.dataset

I am trying to get only the columns what I want, like how we do in pandas. use_cols = ["ArrDelay", "DepDelay"] df = pd.read_csv(path, usecols=use_cols) df Is there an option similar to that in arrow ? dataset = ds.dataset(path, format="csv")
Xion
  • 319
  • 2
  • 11
0
votes
2 answers

reading partitioned datasets stored as csv with pyarrow.dataset

Is there a way in pyarrow how to read in partitioned dataset comprising of csv files that do not have column names stored on the first row? What i am trying to do is essentially: from pyarrow import dataset as ds from pyarrow import fs filesystem =…
ira
  • 2,542
  • 2
  • 22
  • 36
0
votes
1 answer

Poor performance Arrow Parquet multiple files

After watching the mind-blowing webinar at Rstudio conference here I was pumped enough to dump an entire SQL server table to parquet files. The result was 2886 files, (78 entities over 37 months) with around 700 millons rows in total. Doing a…
0
votes
1 answer

Filter expression not supported for Arrow Datasets

I'm using arrow package in R. I need to filter strings, so for example I have 700 million rows I need to get only those that contain "Walmart", but I get the error below. FileSystemDataset with 2886 Parquet files DatoID: int32 BanktransaksjonID:…
0
votes
1 answer

Is there a way to get filter out of Arrow::Array and some predicate?

Assume that I have an Arrow::Array (or Dataframe or ChunkedArray, not important) and I have some predicate. I want to compute a new Arrow::BooleanArray which just stores result of this predicate applied to each of the array element. My case is that…
Kirill Lykov
  • 1,293
  • 2
  • 22
  • 39
0
votes
0 answers

Could redis overwrite direct memory using by Java?

I am using Apache Arrow Java API, which access the direct memory. I am also using Redis, when this Java API accessing direct memory, Redis xstream continues to grow in memory. I found occasionally, Arrow would calculate the wrong result of following…
Litchy
  • 623
  • 7
  • 23
0
votes
1 answer

Apache Arrow getting vectors from Java in Python with zero copy

I use Apache Arrow libraries in java (arrow-vector, arrow-memory-unsafe) and python (pyarrow) in different processes I try to implement in memory zero copy DataFrame, but I can’t find appropriate API in java libraries to get memory address of arrow…
EshtIO
  • 231
  • 4
  • 8
0
votes
2 answers

How to convert Pandas dataframe to PyArrow table with a union type in the schema?

I have a Pandas dataframe with a column that contains a list of dict/structs. One of the keys (thing in the example below) can have a value that is either an int or a string. Is there a way to define a PyArrow type that will allow this dataframe to…
King Chung Huang
  • 5,026
  • 28
  • 24
0
votes
1 answer

http request with parquet and pyarrow

I would like to use pyarrow to read/query parquet data from a rest server. At the moment I'm chunking the data, converting to pandas, dumping to json, and streaming the chunks. Like: p = pq.ParquetDataset('/path/to/data.parquet', filters=filter,…
postelrich
  • 3,274
  • 5
  • 38
  • 65
0
votes
1 answer

Cannot initialize pyarrow with cpp libraries

How do I install pyarrow so that it uses cpp or cython code? I cannot use conda in my project (like documentation suggests) pa.get_include() '/usr/local/lib/python3.7/site-packages/pyarrow/include' pa.get_libraries() ['arrow', 'arrow_python']
maremare
  • 414
  • 1
  • 4
  • 19