Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions

votes

2 answers

merge parquet files with different schema using pandas and dask

I have a parquet directory with around 1000 files and the schemas are different. I wanted to merge all those files in to an optimal number of files with file repartition. I using pandas with pyarrow to read each partition file from the directory and…

asked May 22 '20 at 14:39

Learnis

votes

1 answer

How to read the arrow parquet key value metadata?

When I save a parquet file in R and Python (using pyarrow) I get a arrow schema string saved in the metadata. How do I read the metadata? Is it Flatbuffer encoded data? Where is the definition for the schema? It's not listed on the arrow…

parquet pyarrow apache-arrow

asked May 10 '20 at 04:26

xiaodai

14,889
18
76
140

votes

1 answer

Pyspark: pyarrow.lib.ArrowInvalid: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)

I have a dataframe like the following df.show(5, False) +------------------------------------+-------------------+--------+-------+--------+ |ID |timestamp |accuracy|lat |lon …

python pyspark pyarrow

asked May 05 '20 at 06:07

emax

6,965
19
74
141

votes

1 answer

Reading a huge .csv file in Jupyter Notebook

I'm trying to read data from .csv file in Jupyter Notebook (Python) .csv file is 8.5G, 70 million rows, and 30 columns When I try to read .csv, i get errors. Below are my codes import pandas as pd log = pd.read_csv('log_20100424.csv', engine =…

python pandas csv jupyter-notebook pyarrow

asked Apr 23 '20 at 17:36

jwowowo

votes

0 answers

Pyarrow table write with two depth struct schema raises "Nested column branch had multiple children"

I'm trying to write the following table with pyarrow in a parquet file: In [61]: values = [{"field_a": {"square": i**2, "cube": i**3}, "field_b": {"foo": "bar"}} for i in range(10)] In [62]: somedf = pd.DataFrame({"calculations": values}) In [63]:…

python pandas parquet pyarrow

asked Mar 25 '20 at 15:39

Otávio Vasques

votes

1 answer

Storing Parquet file partitioning columns in different files

I'd like to store a tabular dataset in parquet format, using different files for different column groups. Is it possible to partition the parquet file column-wise? If so, is it possible to do it using python (pyarrow)? I have a large dataset that…

python pandas parquet pyarrow apache-arrow

asked Mar 05 '20 at 11:55

user2304916

7,882
5
39
53

votes

1 answer

Read/Write Parquet with Struct column type

I am trying to write a Dataframe like this to Parquet: | foo | bar | |-----|-------------------| | 1 | {"a": 1, "b": 10} | | 2 | {"a": 2, "b": 20} | | 3 | {"a": 3, "b": 30} | I am doing it with Pandas and Fastparquet: df =…

apache-spark pyspark apache-spark-sql pyarrow fastparquet

asked Feb 14 '20 at 13:17

Dario Chi

votes

1 answer

Write large pandas dataframe as parquet with pyarrow

I'm trying to write a large pandas dataframe (shape 4247x10) Nothing special, just using next code: df_base = read_from_google_storage() df_base.to_parquet(courses.CORE_PATH, engine='pyarrow', …

python pandas pyarrow

asked Dec 31 '19 at 15:18

sann05

votes

2 answers

pyarrow.lib.ArrowIOError: Invalid Parquet file size is 0 bytes

I'm trying to do something like this, reading a list of files from an S3 bucket into a pyarrow table. If I specify the filename I can do: from pyarrow.parquet import ParquetDataset import s3fs dataset = ParquetDataset( …

python boto3 pyarrow

asked Oct 31 '19 at 15:32

LondonRob

73,083
37
144
201

votes

1 answer

Why partitioned parquet files consume larger disk space?

I am learning about parquet file using python and pyarrow. Parquet is great in compression and minimizing disk space. My dataset is 190MB csv file which ends up as single 3MB file when saved as snappy-compressed parquet file. However when I am…

python parquet pyarrow

asked Oct 13 '19 at 05:34

addicted

2,901
3
28
49

votes

1 answer

Pyarrow Dataset read specific columns and specific rows

Is there a way to use pyarrow parquet dataset to read specific columns and if possible filter data instead of reading a whole file into dataframe?

python parquet pyarrow

asked Sep 10 '19 at 22:07

Punter Vicky

15,954
56
188
315

votes

1 answer

Parquet issue with infering schema on int column containing Null

I am reading the s3 key and converting it into parquet using pandas. And before converting into parquet I am type casting it so that pyarrow can infer the schema correctly. The snippet looks something like below: df =…

pandas amazon-s3 parquet pyarrow

asked Aug 30 '19 at 05:07

Gagan

1,775
5
31
59

votes

1 answer

How to catch Python UDF exceptions when using PyArrow

When PyArrow is enabled, Pandas UDF exceptions raised by the Executor become impossible to catch: see example below. Is this expected behavior? If so, what is the rationale. If not, how do I fix this? Confirmed behavior in PyArrow 0.11 and 0.14.1…

python pyspark user-defined-functions pyarrow

asked Aug 22 '19 at 13:41

valend.in

votes

0 answers

Is there a way to increase Binary Array capacity in ray/pyarrow?

Is there a way increase BinaryArray limit in pyarrow? I'm hitting this exception when using ray.get: Capacity error: BinaryArray cannot contain more than 2147483646 bytes, have 2147483655

pyarrow ray

asked Jul 30 '19 at 02:03

user3856836

votes

2 answers

UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings

I am running spark 2.4.2 locally through pyspark for an ML project in NLP. Part of the pre-processing steps in the Pipeline involve the use of pandas_udf functions optimized through pyarrow. Each time I operate with the pre-processed spark dataframe…

apache-spark pyspark user-defined-functions pyarrow

asked Jul 14 '19 at 15:04

Ferran

Prev 1 2 3

…

71 72 Next