Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions
4
votes
2 answers

merge parquet files with different schema using pandas and dask

I have a parquet directory with around 1000 files and the schemas are different. I wanted to merge all those files in to an optimal number of files with file repartition. I using pandas with pyarrow to read each partition file from the directory and…
Learnis
  • 526
  • 5
  • 25
4
votes
1 answer

How to read the arrow parquet key value metadata?

When I save a parquet file in R and Python (using pyarrow) I get a arrow schema string saved in the metadata. How do I read the metadata? Is it Flatbuffer encoded data? Where is the definition for the schema? It's not listed on the arrow…
xiaodai
  • 14,889
  • 18
  • 76
  • 140
4
votes
1 answer

Pyspark: pyarrow.lib.ArrowInvalid: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)

I have a dataframe like the following df.show(5, False) +------------------------------------+-------------------+--------+-------+--------+ |ID |timestamp |accuracy|lat |lon …
emax
  • 6,965
  • 19
  • 74
  • 141
4
votes
1 answer

Reading a huge .csv file in Jupyter Notebook

I'm trying to read data from .csv file in Jupyter Notebook (Python) .csv file is 8.5G, 70 million rows, and 30 columns When I try to read .csv, i get errors. Below are my codes import pandas as pd log = pd.read_csv('log_20100424.csv', engine =…
jwowowo
  • 41
  • 1
  • 2
4
votes
0 answers

Pyarrow table write with two depth struct schema raises "Nested column branch had multiple children"

I'm trying to write the following table with pyarrow in a parquet file: In [61]: values = [{"field_a": {"square": i**2, "cube": i**3}, "field_b": {"foo": "bar"}} for i in range(10)] In [62]: somedf = pd.DataFrame({"calculations": values}) In [63]:…
4
votes
1 answer

Storing Parquet file partitioning columns in different files

I'd like to store a tabular dataset in parquet format, using different files for different column groups. Is it possible to partition the parquet file column-wise? If so, is it possible to do it using python (pyarrow)? I have a large dataset that…
user2304916
  • 7,882
  • 5
  • 39
  • 53
4
votes
1 answer

Read/Write Parquet with Struct column type

I am trying to write a Dataframe like this to Parquet: | foo | bar | |-----|-------------------| | 1 | {"a": 1, "b": 10} | | 2 | {"a": 2, "b": 20} | | 3 | {"a": 3, "b": 30} | I am doing it with Pandas and Fastparquet: df =…
4
votes
1 answer

Write large pandas dataframe as parquet with pyarrow

I'm trying to write a large pandas dataframe (shape 4247x10) Nothing special, just using next code: df_base = read_from_google_storage() df_base.to_parquet(courses.CORE_PATH, engine='pyarrow', …
sann05
  • 414
  • 7
  • 18
4
votes
2 answers

pyarrow.lib.ArrowIOError: Invalid Parquet file size is 0 bytes

I'm trying to do something like this, reading a list of files from an S3 bucket into a pyarrow table. If I specify the filename I can do: from pyarrow.parquet import ParquetDataset import s3fs dataset = ParquetDataset( …
LondonRob
  • 73,083
  • 37
  • 144
  • 201
4
votes
1 answer

Why partitioned parquet files consume larger disk space?

I am learning about parquet file using python and pyarrow. Parquet is great in compression and minimizing disk space. My dataset is 190MB csv file which ends up as single 3MB file when saved as snappy-compressed parquet file. However when I am…
addicted
  • 2,901
  • 3
  • 28
  • 49
4
votes
1 answer

Pyarrow Dataset read specific columns and specific rows

Is there a way to use pyarrow parquet dataset to read specific columns and if possible filter data instead of reading a whole file into dataframe?
Punter Vicky
  • 15,954
  • 56
  • 188
  • 315
4
votes
1 answer

Parquet issue with infering schema on int column containing Null

I am reading the s3 key and converting it into parquet using pandas. And before converting into parquet I am type casting it so that pyarrow can infer the schema correctly. The snippet looks something like below: df =…
Gagan
  • 1,775
  • 5
  • 31
  • 59
4
votes
1 answer

How to catch Python UDF exceptions when using PyArrow

When PyArrow is enabled, Pandas UDF exceptions raised by the Executor become impossible to catch: see example below. Is this expected behavior? If so, what is the rationale. If not, how do I fix this? Confirmed behavior in PyArrow 0.11 and 0.14.1…
valend.in
  • 383
  • 1
  • 2
  • 8
4
votes
0 answers

Is there a way to increase Binary Array capacity in ray/pyarrow?

Is there a way increase BinaryArray limit in pyarrow? I'm hitting this exception when using ray.get: Capacity error: BinaryArray cannot contain more than 2147483646 bytes, have 2147483655
4
votes
2 answers

UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings

I am running spark 2.4.2 locally through pyspark for an ML project in NLP. Part of the pre-processing steps in the Pipeline involve the use of pandas_udf functions optimized through pyarrow. Each time I operate with the pre-processed spark dataframe…
Ferran
  • 840
  • 9
  • 18