Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions

votes

3 answers

How to open huge parquet file using Pandas without enough RAM

I am trying to read a decently large Parquet file (~2 GB with about ~30 million rows) into my Jupyter Notebook (in Python 3) using the Pandas read_parquet function. I have also installed the pyarrow and fastparquet libraries which the read_parquet…

asked Feb 11 '20 at 03:59

qxzsilver

votes

2 answers

Unable to read a parquet file

I am breaking my head over this right now. I am new to this parquet files, and I am running into a LOT of issues with it. I am thrown an error that reads OSError: Passed non-file path: \datasets\proj\train\train.parquet each time I try to create a…

python pandas parquet pyarrow fastparquet

asked Mar 13 '19 at 16:58

Anonymous Person

1,437
8
26
47

votes

2 answers

how to efficiently split a large dataframe into many parquet files?

Consider the following dataframe import pandas as pd import numpy as np import pyarrow.parquet as pq import pyarrow as pa idx = pd.date_range('2017-01-01 12:00:00.000', '2017-03-01 12:00:00.000', freq = 'T') dataframe = pd.DataFrame({'numeric_col'…

python pandas parquet pyarrow

asked Jun 12 '18 at 19:57

ℕʘʘḆḽḘ

18,566
34
128
235

votes

2 answers

Can I store a Parquet file with a dictionary column having mixed types in their values?

I am trying to store a Python Pandas DataFrame as a Parquet file, but I am experiencing some issues. One of the columns of my Pandas DF contains dictionaries as such: import pandas as pandas df = pd.DataFrame({ "ColA": [1, 2, 3], "ColB":…

python pandas dataframe parquet pyarrow

asked Aug 05 '20 at 16:42

juangesino

votes

1 answer

AWS Athena: HIVE_BAD_DATA ERROR: Field type DOUBLE in parquet is incompatible with type defined in table schema

I use AWS Athena to query some data stored in S3, namely partitioned parquet files with pyarrow compression. I have three columns with string values, one column called "key" with int values and one column called "result" which have both double and…

hive parquet amazon-athena pyarrow

asked May 22 '20 at 06:09

Sarathy Velmurugan

votes

2 answers

PySpark pandas_udfs java.lang.IllegalArgumentException error

Does anyone have experience using pandas UDFs on a local pyspark session running on Windows? I've used them on linux with good results, but I've been unsuccessful on my Windows…

pandas apache-spark pyspark pyarrow

asked Feb 19 '20 at 18:05

Matt

votes

3 answers

No module named 'pyarrow._orc'

I have a problem using pyarrow.orc module in Anaconda on Windows 10. import pyarrow.orc as orc throws an exception: Traceback (most recent call last): File "", line 1, in File…

python anaconda conda pyarrow

asked Nov 12 '19 at 15:47

rwiatr

votes

3 answers

How to assign arbitrary metadata to pyarrow.Table / Parquet columns

Use-case I am using Apache Parquet files as a fast IO format for large-ish spatial data that I am working on in Python with GeoPandas. I am storing feature geometries as WKB and would like to record the coordinate reference system (CRS) as metadata…

python pandas gis parquet pyarrow

asked Apr 06 '19 at 04:52

d.arcy

votes

2 answers

Pyarrow apply schema when using pandas to_parquet()

I have a very wide data frame (20,000 columns) that is mainly made up of float64 columns in Pandas. I want to cast these columns to float32 and write to Parquet format. I am doing this because the down steam user of these files are small containers…

python pandas pyarrow

asked Oct 17 '18 at 08:42

warwickh

votes

1 answer

Pandas Dataframe Parquet Data Types?

I am trying to use Pandas and Pyarrow to parquet data. I have hundreds of parquet files that don't need to have the same schema but if columns match across parquets they must have the same data type. I'm getting into situations where the resulting…

python pandas numpy parquet pyarrow

asked Sep 10 '18 at 19:18

micah

7,596
10
49
90

votes

1 answer

RuntimeError: Unsupported type in conversion to Arrow: VectorUDT

I want to convert a big spark data frame to Pandas with more than 1000000 rows. I tried to convert a spark data Frame to Pandas data frame using the following code: spark.conf.set("spark.sql.execution.arrow.enabled", "true") result.toPandas() But,…

pandas apache-spark dataframe pyspark pyarrow

asked Jul 04 '18 at 13:59

Saeid SOHEILY KHAH

votes

2 answers

pandas to_parquet fails on large datasets

I'm trying to save a very large dataset using pandas to_parquet, and it seems to fail when exceeding a certain limit, both with 'pyarrow' and 'fastparquet'. I reproduced the errors I am getting with the following code, and would be happy to hear…

pandas parquet pyarrow fastparquet

asked Jun 10 '18 at 09:23

kenissur

votes

1 answer

Assign schema to pa.Table.from_pandas()

Im getting this error when transforming a pandas.DF to parquet using pyArrow: ArrowInvalid('Error converting from Python objects to Int64: Got Python object of type str but can only handle these types: integer To find out which column is the…

python pandas parquet pyarrow

asked Mar 29 '18 at 22:08

Carlos P Ceballos

votes

1 answer

How to store custom Parquet Dataset metadata with pyarrow?

How do I store custom metadata to a ParquetDataset using pyarrow? For example, if I create a Parquet dataset using Dask import dask dask.datasets.timeseries().to_parquet('temp.parq') I can then read it using pyarrow import pyarrow.parquet as…

python parquet pyarrow

asked Sep 10 '21 at 11:10

Dahn

1,397
1
10
29

votes

3 answers

Occur "Could NOT find Arrow" error when using pip_pypy3 to install pyarrow

I am trying to use pypy3 to install pyarrow, but some errors occur. Basic information is blow: macOS 10.15.7 Xcode 12.3 python version 3.7.9 pypy3 version 7.3.3 pyarrow version 0.17.1 cmd is 'pip_pypy3 install pyarrow==0.17.1' Some key information…

python cmake pypy pyarrow

asked Jan 10 '21 at 13:43

Long.zhao

1,085
2
11
16

Prev 1 2 3

…

71 72 Next