Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions
18
votes
1 answer

How to force parquet dtypes when saving pd.DataFrame?

Is there a way to force a parquet file to encode a pd.DataFrame column as a given type, even though all values for the column are null? The fact that parquet automatically assigns "null" in its schema is preventing me from loading many files into a…
HugoMailhot
  • 1,275
  • 1
  • 10
  • 19
15
votes
5 answers

Python error using pyarrow - ArrowNotImplementedError: Support for codec 'snappy' not built

Using Python, Parquet, and Spark and running into ArrowNotImplementedError: Support for codec 'snappy' not built after upgrading to pyarrow=3.0.0. My previous version without this error was pyarrow=0.17. The error does not appear in pyarrow=1.0.1…
Russell Burdt
  • 2,391
  • 2
  • 19
  • 30
15
votes
6 answers

import pyarrow not working <- error is "ValueError: The pyarrow library is not installed, please install pyarrow to use the to_arrow() function."

I have tried installing it in the terminal and in juypter lab and it says that it has been successfully installed but when I run df = query_job.to_dataframe() I keep getting the error " ValueError: The pyarrow library is not installed, please…
Sarah Dodamead
  • 171
  • 1
  • 1
  • 4
15
votes
2 answers

pandasUDF and pyarrow 0.15.0

I have recently started getting a bunch of errors on a number of pyspark jobs running on EMR clusters. The erros are java.lang.IllegalArgumentException at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) at…
ilijaluve
  • 1,050
  • 2
  • 10
  • 24
14
votes
5 answers

pandas df.to_parquet write to multiple smaller files

Is it possible to use Pandas' DataFrame.to_parquet functionality to split writing into multiple files of some approximate desired size? I have a very large DataFrame (100M x 100), and am using df.to_parquet('data.snappy', engine='pyarrow',…
Austin
  • 6,921
  • 12
  • 73
  • 138
14
votes
2 answers

How to save a pandas DataFrame with custom types using pyarrow and parquet

I want to save a pandas DataFrame to parquet, but I have some unsupported types in it (for example bson ObjectIds). Throughout the examples we use: import pandas as pd import pyarrow as pa Here's a minimal example to show the situation: df =…
Silver Duck
  • 581
  • 1
  • 5
  • 18
14
votes
2 answers

Memory leaks when using pandas_udf and Parquet serialization?

I am currently developing my first whole system using PySpark and I am running into some strange, memory-related issues. In one of the stages, I would like to resemble a Split-Apply-Combine strategy in order to modify a DataFrame. That is, I would…
Fernandez
  • 259
  • 3
  • 13
13
votes
1 answer

ArrowTypeError: Did not pass numpy.dtype object', 'Conversion failed for column X with type int32

Problem I am trying to save a data frame as a parquet file on Databricks, getting the ArrowTypeError. Databricks Runtime Version: 7.6 ML (includes Apache Spark 3.0.1, Scala 2.12) Log Trace ArrowTypeError: ('Did not pass numpy.dtype object',…
Naga Budigam
  • 689
  • 1
  • 10
  • 26
13
votes
2 answers

Pytest mocker patch - how to troubleshoot?

I am having what I believe to be a common issue in using mock patching in that I can not figure out the right thing to patch. I have two questions that I am hoping for help with. Thoughts on how to fix the specific issue in the below example And…
user9074332
  • 2,336
  • 2
  • 23
  • 39
13
votes
4 answers

How to read feather/arrow file natively?

I have feather format file sales.feather that I am using for exchanging data between python and R. In R I use the following command: df = arrow::read_feather("sales.feather", as_data_frame=TRUE) In python I used that: df =…
jangorecki
  • 16,384
  • 4
  • 79
  • 160
13
votes
1 answer

writing pandas dataframe with timedeltas to parquet

I can't seem to write a pandas dataframe containing timedeltas to a parquet file through pyarrow. The pyarrow documentation specifies that it can handle numpy timedeltas64 with ms precision. However, when I build a dataframe from numpy's…
Swier
  • 4,047
  • 3
  • 28
  • 52
13
votes
1 answer

Writing parquet files from Python without pandas

I need to transform data from JSON to parquet as a part of an ETL pipeline. I'm currently doing it with the from_pandas method of a pyarrow.Table. However building a dataframe first feels like a unnecessary step, plus I'd like to avoid having…
Milan Cermak
  • 7,476
  • 3
  • 44
  • 59
13
votes
4 answers

How to save a huge pandas dataframe to hdfs?

Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently im converting my pandas dataframe to a spark dataframe like this: dataframe…
Mulgard
  • 9,877
  • 34
  • 129
  • 232
12
votes
3 answers

Can't install pyarrow on OSX / Python 3.9: is this me or an incompatible package?

I'm trying to install pyarrow with pip3 on OSX 11.0.1, and getting error messages. I'm using Python 3.9 and not sure if that is the problem. Here is the error summary: ERROR: Command errored out with exit status 1: command:…
Richard
  • 62,943
  • 126
  • 334
  • 542
12
votes
2 answers

Fastest way to construct pyarrow table row by row

I have a large dictionary that I want to iterate through to build a pyarrow table. The values of the dictionary are tuples of varying types and need to be unpacked and stored in separate columns in the final pyarrow table. I do know the schema ahead…
Josh W.
  • 1,123
  • 1
  • 10
  • 17
1
2
3
71 72