Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

Arrow memory format supports zero-copy reads for lightning-fast data access without serialization overhead.
Columnar layout of data also allows for a better use of CPU caches by placing all data relevant to a column operation in as compact of a format as possible.
Arrow acts as a new high-performance interface between various systems. It is also focused on supporting a wide variety of industry-standard programming languages. Java, C, C++, Python are underway and more languages are expected soon.

For installation details see this

595 questions

votes

1 answer

When should I use dictionary encoding in parquet?

I see that parquet supports dictionary encoding on a per-column basis, and that dictionary encoding is described in the GitHub documentation: Dictionary Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8) The dictionary encoding builds a…

parquet apache-arrow

asked Oct 29 '20 at 23:18

Jason S

184,598
164
608
970

votes

1 answer

How to properly create an apache plasma store from within a python program?

If I run the following as a program: import subprocess subprocess.run(['plasma_store -m 10000000000 -s /tmp/plasma'], shell=True, capture_output=True) and then run the program that uses the plasma store in a separate terminal everything works…

python pyarrow apache-arrow

asked Aug 02 '20 at 18:07

Yale Yng-Wong

votes

1 answer

apache arrow - adequacy for parallel processing

I have a huge dataset and am using Apache Spark for data processing. Using Apache Arrow, we can convert Spark-compatible data-frame to Pandas-compatible data-frame and run operations on it. By converting the data-frame, will it achieve the…

pandas apache-spark apache-arrow

asked Jul 03 '20 at 02:59

aysh

votes

3 answers

unable to install.packages('arrow') to read parquet file (read_parquet). Any other way to read parquet file or use any different library?

I'm very new to R or even bash. I'm trying to read Parquet file from my local using read_parquet function, but it requires to install arrow library: install.packages('arrow'), which is taking forever (read it as stuck/hang on installation step) on…

r parquet install.packages apache-arrow

asked Jun 29 '20 at 10:03

NewDeveloper

votes

2 answers

How does Pyarrow read_csv handle different file encodings?

I have a .dat file that I had been reading with pd.read_csv and always needed to use encoding="latin" for it to read properly / without error. When I use pyarrow.csv.read_csv I dont see a parameter to select the encoding of the file but it still…

csv pyarrow apache-arrow

asked Jun 02 '20 at 13:33

matthewmturner

votes

3 answers

Generating large DataFrame in a distributed way in pyspark efficiently (without pyspark.sql.Row)

The problem boils down to the following: I want to generate a DataFrame in pyspark using existing parallelized collection of inputs and a function which given one input can generate a relatively large batch of rows. In the example below I want to…

apache-spark pyspark pyarrow apache-arrow

asked May 25 '20 at 17:35

Alexander Pivovarov

4,850
1
11
34

votes

1 answer

How to read the arrow parquet key value metadata?

When I save a parquet file in R and Python (using pyarrow) I get a arrow schema string saved in the metadata. How do I read the metadata? Is it Flatbuffer encoded data? Where is the definition for the schema? It's not listed on the arrow…

parquet pyarrow apache-arrow

asked May 10 '20 at 04:26

xiaodai

14,889
18
76
140

votes

1 answer

Storing Parquet file partitioning columns in different files

I'd like to store a tabular dataset in parquet format, using different files for different column groups. Is it possible to partition the parquet file column-wise? If so, is it possible to do it using python (pyarrow)? I have a large dataset that…

python pandas parquet pyarrow apache-arrow

asked Mar 05 '20 at 11:55

user2304916

7,882
5
39
53

votes

0 answers

Trying to understand how Apache Arrow's in-memory store works

Apache Arrow is an in-memory serialization format. Part of Arrow is Plasma, an in-memory object store designed to share data efficiently among processes on the same machine. I want to get a better understanding of what this is. My first-order…

memory apache-arrow

asked Dec 11 '19 at 21:53

SWV

votes

1 answer

Is there a way to set a minimum batch size for a pandas_udf in PySpark?

I am using a pandas_udf to apply a machine learning model on my spark cluster and am interested in predefining the minimum number of records sent via arrow to the UDF. I followed the databricks tutorial for the bulk of the UDF...…

python pandas apache-spark pyspark apache-arrow

asked May 21 '19 at 21:22

Jlanday

votes

2 answers

Mysterious 'pyarrow.lib.ArrowInvalid: Floating point value truncated' ERROR when use toPandas() on a DataFrame in pyspark

I use toPandas() on a DataFrame which is not very large, but I get the following exception: 18/10/31 19:13:19 ERROR Executor: Exception in task 127.2 in stage 13.0 (TID 2264) org.apache.spark.api.python.PythonException: Traceback (most recent call…

apache-spark pyspark apache-spark-sql pyarrow apache-arrow

asked Oct 31 '18 at 11:51

Hao

votes

1 answer

What is the difference between Apache Spark and Apache Arrow?

What are the differences between Apache Arrow and Apache Spark? Will Apache Arrow replace Hadoop?

hadoop apache-spark apache-arrow bigdata

asked Mar 09 '16 at 06:52

Wanderer

votes

0 answers

pyarrow memory consumption difference between Dataset.to_batches and ParquetFile.iter_batches

I am using pyarrow and am struggling to understand the big difference in memory usage between the Dataset.to_batches method compared to ParquetFile.iter_batches. Using pyarrow.dataset >>> import pyarrow as pa >>> import pyarrow.dataset as ds >>> >>>…

parquet pyarrow apache-arrow

asked Aug 04 '23 at 01:11

teejay

votes

2 answers

How to migrate pandas code to pandas arrow?

Recently pandas 2.0 supported arrow datatypes, which seem to have many advantages over the standard datatypes, both in speed and with nan support. I need to migrate a large code base to pandas arrow and I was wondering which kind of problems I may…

python pandas apache-arrow

asked Jul 21 '23 at 15:08

Ziur Olpa

1,839
1
12
27

votes

1 answer

Very slow aggregate on Pandas 2.0 dataframe with pyarrow as dtype_backend

Let's say I have the following dataframe: Code Price AA1 10 AA1 20 BB2 30 And I want to perform the following operation on it: df.groupby("code").aggregate({ "price": "sum" }) I have tried playing with the new pyarrow dtypes…

python pandas group-by pyarrow apache-arrow

asked Apr 03 '23 at 09:06

ADEL NAMANI

Prev 1 2 3

…

39 40 Next