Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions
9
votes
3 answers

How to open huge parquet file using Pandas without enough RAM

I am trying to read a decently large Parquet file (~2 GB with about ~30 million rows) into my Jupyter Notebook (in Python 3) using the Pandas read_parquet function. I have also installed the pyarrow and fastparquet libraries which the read_parquet…
qxzsilver
  • 522
  • 1
  • 6
  • 21
9
votes
2 answers

Unable to read a parquet file

I am breaking my head over this right now. I am new to this parquet files, and I am running into a LOT of issues with it. I am thrown an error that reads OSError: Passed non-file path: \datasets\proj\train\train.parquet each time I try to create a…
Anonymous Person
  • 1,437
  • 8
  • 26
  • 47
9
votes
2 answers

how to efficiently split a large dataframe into many parquet files?

Consider the following dataframe import pandas as pd import numpy as np import pyarrow.parquet as pq import pyarrow as pa idx = pd.date_range('2017-01-01 12:00:00.000', '2017-03-01 12:00:00.000', freq = 'T') dataframe = pd.DataFrame({'numeric_col'…
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
8
votes
2 answers

Can I store a Parquet file with a dictionary column having mixed types in their values?

I am trying to store a Python Pandas DataFrame as a Parquet file, but I am experiencing some issues. One of the columns of my Pandas DF contains dictionaries as such: import pandas as pandas df = pd.DataFrame({ "ColA": [1, 2, 3], "ColB":…
juangesino
  • 444
  • 2
  • 7
  • 10
8
votes
1 answer

AWS Athena: HIVE_BAD_DATA ERROR: Field type DOUBLE in parquet is incompatible with type defined in table schema

I use AWS Athena to query some data stored in S3, namely partitioned parquet files with pyarrow compression. I have three columns with string values, one column called "key" with int values and one column called "result" which have both double and…
Sarathy Velmurugan
  • 123
  • 1
  • 2
  • 10
8
votes
2 answers

PySpark pandas_udfs java.lang.IllegalArgumentException error

Does anyone have experience using pandas UDFs on a local pyspark session running on Windows? I've used them on linux with good results, but I've been unsuccessful on my Windows…
Matt
  • 83
  • 1
  • 4
8
votes
3 answers

No module named 'pyarrow._orc'

I have a problem using pyarrow.orc module in Anaconda on Windows 10. import pyarrow.orc as orc throws an exception: Traceback (most recent call last): File "", line 1, in File…
rwiatr
  • 83
  • 1
  • 4
8
votes
3 answers

How to assign arbitrary metadata to pyarrow.Table / Parquet columns

Use-case I am using Apache Parquet files as a fast IO format for large-ish spatial data that I am working on in Python with GeoPandas. I am storing feature geometries as WKB and would like to record the coordinate reference system (CRS) as metadata…
d.arcy
  • 83
  • 1
  • 6
8
votes
2 answers

Pyarrow apply schema when using pandas to_parquet()

I have a very wide data frame (20,000 columns) that is mainly made up of float64 columns in Pandas. I want to cast these columns to float32 and write to Parquet format. I am doing this because the down steam user of these files are small containers…
warwickh
  • 189
  • 3
  • 10
8
votes
1 answer

Pandas Dataframe Parquet Data Types?

I am trying to use Pandas and Pyarrow to parquet data. I have hundreds of parquet files that don't need to have the same schema but if columns match across parquets they must have the same data type. I'm getting into situations where the resulting…
micah
  • 7,596
  • 10
  • 49
  • 90
8
votes
1 answer

RuntimeError: Unsupported type in conversion to Arrow: VectorUDT

I want to convert a big spark data frame to Pandas with more than 1000000 rows. I tried to convert a spark data Frame to Pandas data frame using the following code: spark.conf.set("spark.sql.execution.arrow.enabled", "true") result.toPandas() But,…
Saeid SOHEILY KHAH
  • 747
  • 3
  • 10
  • 23
8
votes
2 answers

pandas to_parquet fails on large datasets

I'm trying to save a very large dataset using pandas to_parquet, and it seems to fail when exceeding a certain limit, both with 'pyarrow' and 'fastparquet'. I reproduced the errors I am getting with the following code, and would be happy to hear…
kenissur
  • 171
  • 1
  • 2
  • 7
8
votes
1 answer

Assign schema to pa.Table.from_pandas()

Im getting this error when transforming a pandas.DF to parquet using pyArrow: ArrowInvalid('Error converting from Python objects to Int64: Got Python object of type str but can only handle these types: integer To find out which column is the…
Carlos P Ceballos
  • 384
  • 1
  • 7
  • 20
7
votes
1 answer

How to store custom Parquet Dataset metadata with pyarrow?

How do I store custom metadata to a ParquetDataset using pyarrow? For example, if I create a Parquet dataset using Dask import dask dask.datasets.timeseries().to_parquet('temp.parq') I can then read it using pyarrow import pyarrow.parquet as…
Dahn
  • 1,397
  • 1
  • 10
  • 29
7
votes
3 answers

Occur "Could NOT find Arrow" error when using pip_pypy3 to install pyarrow

I am trying to use pypy3 to install pyarrow, but some errors occur. Basic information is blow: macOS 10.15.7 Xcode 12.3 python version 3.7.9 pypy3 version 7.3.3 pyarrow version 0.17.1 cmd is 'pip_pypy3 install pyarrow==0.17.1' Some key information…
Long.zhao
  • 1,085
  • 2
  • 11
  • 16