Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

For installation details see this

595 questions
4
votes
1 answer

When should I use dictionary encoding in parquet?

I see that parquet supports dictionary encoding on a per-column basis, and that dictionary encoding is described in the GitHub documentation: Dictionary Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8) The dictionary encoding builds a…
Jason S
  • 184,598
  • 164
  • 608
  • 970
4
votes
1 answer

How to properly create an apache plasma store from within a python program?

If I run the following as a program: import subprocess subprocess.run(['plasma_store -m 10000000000 -s /tmp/plasma'], shell=True, capture_output=True) and then run the program that uses the plasma store in a separate terminal everything works…
4
votes
1 answer

apache arrow - adequacy for parallel processing

I have a huge dataset and am using Apache Spark for data processing. Using Apache Arrow, we can convert Spark-compatible data-frame to Pandas-compatible data-frame and run operations on it. By converting the data-frame, will it achieve the…
aysh
  • 493
  • 1
  • 11
  • 23
4
votes
3 answers

unable to install.packages('arrow') to read parquet file (read_parquet). Any other way to read parquet file or use any different library?

I'm very new to R or even bash. I'm trying to read Parquet file from my local using read_parquet function, but it requires to install arrow library: install.packages('arrow'), which is taking forever (read it as stuck/hang on installation step) on…
4
votes
2 answers

How does Pyarrow read_csv handle different file encodings?

I have a .dat file that I had been reading with pd.read_csv and always needed to use encoding="latin" for it to read properly / without error. When I use pyarrow.csv.read_csv I dont see a parameter to select the encoding of the file but it still…
matthewmturner
  • 566
  • 7
  • 21
4
votes
3 answers

Generating large DataFrame in a distributed way in pyspark efficiently (without pyspark.sql.Row)

The problem boils down to the following: I want to generate a DataFrame in pyspark using existing parallelized collection of inputs and a function which given one input can generate a relatively large batch of rows. In the example below I want to…
Alexander Pivovarov
  • 4,850
  • 1
  • 11
  • 34
4
votes
1 answer

How to read the arrow parquet key value metadata?

When I save a parquet file in R and Python (using pyarrow) I get a arrow schema string saved in the metadata. How do I read the metadata? Is it Flatbuffer encoded data? Where is the definition for the schema? It's not listed on the arrow…
xiaodai
  • 14,889
  • 18
  • 76
  • 140
4
votes
1 answer

Storing Parquet file partitioning columns in different files

I'd like to store a tabular dataset in parquet format, using different files for different column groups. Is it possible to partition the parquet file column-wise? If so, is it possible to do it using python (pyarrow)? I have a large dataset that…
user2304916
  • 7,882
  • 5
  • 39
  • 53
4
votes
0 answers

Trying to understand how Apache Arrow's in-memory store works

Apache Arrow is an in-memory serialization format. Part of Arrow is Plasma, an in-memory object store designed to share data efficiently among processes on the same machine. I want to get a better understanding of what this is. My first-order…
SWV
  • 195
  • 2
  • 6
4
votes
1 answer

Is there a way to set a minimum batch size for a pandas_udf in PySpark?

I am using a pandas_udf to apply a machine learning model on my spark cluster and am interested in predefining the minimum number of records sent via arrow to the UDF. I followed the databricks tutorial for the bulk of the UDF...…
Jlanday
  • 112
  • 5
4
votes
2 answers

Mysterious 'pyarrow.lib.ArrowInvalid: Floating point value truncated' ERROR when use toPandas() on a DataFrame in pyspark

I use toPandas() on a DataFrame which is not very large, but I get the following exception: 18/10/31 19:13:19 ERROR Executor: Exception in task 127.2 in stage 13.0 (TID 2264) org.apache.spark.api.python.PythonException: Traceback (most recent call…
Hao
  • 43
  • 1
  • 5
4
votes
1 answer

What is the difference between Apache Spark and Apache Arrow?

What are the differences between Apache Arrow and Apache Spark? Will Apache Arrow replace Hadoop?
Wanderer
  • 447
  • 3
  • 11
  • 20
3
votes
0 answers

pyarrow memory consumption difference between Dataset.to_batches and ParquetFile.iter_batches

I am using pyarrow and am struggling to understand the big difference in memory usage between the Dataset.to_batches method compared to ParquetFile.iter_batches. Using pyarrow.dataset >>> import pyarrow as pa >>> import pyarrow.dataset as ds >>> >>>…
teejay
  • 103
  • 8
3
votes
2 answers

How to migrate pandas code to pandas arrow?

Recently pandas 2.0 supported arrow datatypes, which seem to have many advantages over the standard datatypes, both in speed and with nan support. I need to migrate a large code base to pandas arrow and I was wondering which kind of problems I may…
Ziur Olpa
  • 1,839
  • 1
  • 12
  • 27
3
votes
1 answer

Very slow aggregate on Pandas 2.0 dataframe with pyarrow as dtype_backend

Let's say I have the following dataframe: Code Price AA1 10 AA1 20 BB2 30 And I want to perform the following operation on it: df.groupby("code").aggregate({ "price": "sum" }) I have tried playing with the new pyarrow dtypes…
ADEL NAMANI
  • 171
  • 1
  • 2
  • 12