Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

Arrow memory format supports zero-copy reads for lightning-fast data access without serialization overhead.
Columnar layout of data also allows for a better use of CPU caches by placing all data relevant to a column operation in as compact of a format as possible.
Arrow acts as a new high-performance interface between various systems. It is also focused on supporting a wide variety of industry-standard programming languages. Java, C, C++, Python are underway and more languages are expected soon.

For installation details see this

595 questions

votes

0 answers

How to keep trailing zero after reading from arrow files

Using apache-arrow js (https://github.com/apache/arrow/tree/master/js), I can read arrow file (or even feather file) by a few lines only. const arrow = fs.readFileSync("test.feather"); const table = apArrow.Table.from([arrow]); However I found…

asked Apr 29 '21 at 10:03

appletabo

votes

2 answers

Reading Arrow Feather files in GoLang or Javascript

I am looking for a way to read the feather files via GoLang or Javascript, or some other languages that does not require users to do some other extra installation. My goal is to provide a User-interface to read a feather csv file and convert it back…

javascript python go apache-arrow feather

asked Apr 26 '21 at 03:03

appletabo

votes

1 answer

Apache Arrow writing nested types in parquet with C++

unfortunately i found no c++ example which writes nested types like maps with Apache Arrow into parquet. The creation of the schemata is clear but not the arrow table creation part. Has anyboy a hint or link on an example? Many thanks in advance!

apache-arrow

asked Apr 24 '21 at 09:40

think_positiv

votes

1 answer

How can I achieve predicate pushdown when using PyArrow + Parquet + Google Cloud Storage?

What I'm really trying to do is this (in Python): import pyarrow.parquet as pq # Note the 'columns' predicate... table = pq.read_table('gs://my_bucket/my_blob.parquet', columns=['a', 'b', 'c']) First, I don't think that gs:// is supported in…

google-cloud-storage parquet pyarrow apache-arrow gcsfuse

asked Apr 21 '21 at 17:28

user5406764

1,627
2
16
23

votes

0 answers

PyArrow: mmap-backed pass-through array?

In pyarrow, what is a proper way to construct an mmap-backed pass-through array, meaning: to have a fixed-size, fixed-schema pyarrow.Array backed by a buffer, which is based on a pyarrow.MemoryMappedFile, such that changes to the array are directly…

python arrays mmap pyarrow apache-arrow

asked Apr 03 '21 at 23:31

Andrei Pozolotin

votes

1 answer

How to specify which columns to load in pyarrow.dataset

I am trying to get only the columns what I want, like how we do in pandas. use_cols = ["ArrDelay", "DepDelay"] df = pd.read_csv(path, usecols=use_cols) df Is there an option similar to that in arrow ? dataset = ds.dataset(path, format="csv")

python-3.x pandas pyarrow apache-arrow

asked Mar 18 '21 at 00:48

Xion

votes

2 answers

reading partitioned datasets stored as csv with pyarrow.dataset

Is there a way in pyarrow how to read in partitioned dataset comprising of csv files that do not have column names stored on the first row? What i am trying to do is essentially: from pyarrow import dataset as ds from pyarrow import fs filesystem =…

csv pyarrow data-partitioning apache-arrow

asked Mar 11 '21 at 15:34

ira

2,542
2
22
36

votes

1 answer

Poor performance Arrow Parquet multiple files

After watching the mind-blowing webinar at Rstudio conference here I was pumped enough to dump an entire SQL server table to parquet files. The result was 2886 files, (78 entities over 37 months) with around 700 millons rows in total. Doing a…

r parquet apache-arrow

asked Feb 08 '21 at 11:18

Patricio Lobos

votes

1 answer

Filter expression not supported for Arrow Datasets

I'm using arrow package in R. I need to filter strings, so for example I have 700 million rows I need to get only those that contain "Walmart", but I get the error below. FileSystemDataset with 2886 Parquet files DatoID: int32 BanktransaksjonID:…

r stringr grepl apache-arrow

asked Feb 03 '21 at 09:00

Patricio Lobos

votes

1 answer

Is there a way to get filter out of Arrow::Array and some predicate?

Assume that I have an Arrow::Array (or Dataframe or ChunkedArray, not important) and I have some predicate. I want to compute a new Arrow::BooleanArray which just stores result of this predicate applied to each of the array element. My case is that…

c++ apache-arrow

asked Jan 21 '21 at 14:18

Kirill Lykov

1,293
2
22
39

votes

0 answers

Could redis overwrite direct memory using by Java?

I am using Apache Arrow Java API, which access the direct memory. I am also using Redis, when this Java API accessing direct memory, Redis xstream continues to grow in memory. I found occasionally, Arrow would calculate the wrong result of following…

redis heap-memory apache-arrow directmemory

asked Jan 06 '21 at 08:50

Litchy

votes

1 answer

Apache Arrow getting vectors from Java in Python with zero copy

I use Apache Arrow libraries in java (arrow-vector, arrow-memory-unsafe) and python (pyarrow) in different processes I try to implement in memory zero copy DataFrame, but I can’t find appropriate API in java libraries to get memory address of arrow…

java python apache-arrow

asked Dec 30 '20 at 15:30

EshtIO

votes

2 answers

How to convert Pandas dataframe to PyArrow table with a union type in the schema?

I have a Pandas dataframe with a column that contains a list of dict/structs. One of the keys (thing in the example below) can have a value that is either an int or a string. Is there a way to define a PyArrow type that will allow this dataframe to…

pandas pyarrow apache-arrow

asked Dec 06 '20 at 18:18

King Chung Huang

5,026
28
24

votes

1 answer

http request with parquet and pyarrow

I would like to use pyarrow to read/query parquet data from a rest server. At the moment I'm chunking the data, converting to pandas, dumping to json, and streaming the chunks. Like: p = pq.ParquetDataset('/path/to/data.parquet', filters=filter,…

parquet pyarrow apache-arrow

asked Nov 20 '20 at 01:51

postelrich

3,274
5
38
65

votes

1 answer

Cannot initialize pyarrow with cpp libraries

How do I install pyarrow so that it uses cpp or cython code? I cannot use conda in my project (like documentation suggests) pa.get_include() '/usr/local/lib/python3.7/site-packages/pyarrow/include' pa.get_libraries() ['arrow', 'arrow_python']

python pyarrow apache-arrow

asked Nov 13 '20 at 17:56

maremare

Prev 1 2 3

…

39 40 Next