Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions

votes

1 answer

How to force parquet dtypes when saving pd.DataFrame?

Is there a way to force a parquet file to encode a pd.DataFrame column as a given type, even though all values for the column are null? The fact that parquet automatically assigns "null" in its schema is preventing me from loading many files into a…

asked May 01 '18 at 00:45

HugoMailhot

1,275
1
10
19

votes

5 answers

Python error using pyarrow - ArrowNotImplementedError: Support for codec 'snappy' not built

Using Python, Parquet, and Spark and running into ArrowNotImplementedError: Support for codec 'snappy' not built after upgrading to pyarrow=3.0.0. My previous version without this error was pyarrow=0.17. The error does not appear in pyarrow=1.0.1…

parquet pyarrow apache-arrow

asked Feb 02 '21 at 21:19

Russell Burdt

2,391
2
19
30

votes

6 answers

import pyarrow not working <- error is "ValueError: The pyarrow library is not installed, please install pyarrow to use the to_arrow() function."

I have tried installing it in the terminal and in juypter lab and it says that it has been successfully installed but when I run df = query_job.to_dataframe() I keep getting the error " ValueError: The pyarrow library is not installed, please…

google-bigquery jupyter pyarrow

asked Dec 13 '20 at 13:03

Sarah Dodamead

votes

2 answers

pandasUDF and pyarrow 0.15.0

I have recently started getting a bunch of errors on a number of pyspark jobs running on EMR clusters. The erros are java.lang.IllegalArgumentException at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) at…

pandas apache-spark pyspark pyarrow

asked Oct 07 '19 at 15:51

ilijaluve

1,050
2
10
24

votes

5 answers

pandas df.to_parquet write to multiple smaller files

Is it possible to use Pandas' DataFrame.to_parquet functionality to split writing into multiple files of some approximate desired size? I have a very large DataFrame (100M x 100), and am using df.to_parquet('data.snappy', engine='pyarrow',…

pandas save parquet pyarrow snappy

asked Sep 06 '20 at 20:33

Austin

6,921
12
73
138

votes

2 answers

How to save a pandas DataFrame with custom types using pyarrow and parquet

I want to save a pandas DataFrame to parquet, but I have some unsupported types in it (for example bson ObjectIds). Throughout the examples we use: import pandas as pd import pyarrow as pa Here's a minimal example to show the situation: df =…

python pandas dataframe parquet pyarrow

asked Apr 17 '20 at 12:08

Silver Duck

votes

2 answers

Memory leaks when using pandas_udf and Parquet serialization?

I am currently developing my first whole system using PySpark and I am running into some strange, memory-related issues. In one of the stages, I would like to resemble a Split-Apply-Combine strategy in order to modify a DataFrame. That is, I would…

python pandas pyspark apache-spark-sql pyarrow

asked May 27 '19 at 15:45

Fernandez

votes

1 answer

ArrowTypeError: Did not pass numpy.dtype object', 'Conversion failed for column X with type int32

Problem I am trying to save a data frame as a parquet file on Databricks, getting the ArrowTypeError. Databricks Runtime Version: 7.6 ML (includes Apache Spark 3.0.1, Scala 2.12) Log Trace ArrowTypeError: ('Did not pass numpy.dtype object',…

python pandas numpy databricks pyarrow

asked May 12 '21 at 11:41

Naga Budigam

votes

2 answers

Pytest mocker patch - how to troubleshoot?

I am having what I believe to be a common issue in using mock patching in that I can not figure out the right thing to patch. I have two questions that I am hoping for help with. Thoughts on how to fix the specific issue in the below example And…

python mocking pytest pyarrow

asked Sep 13 '19 at 18:41

user9074332

2,336
2
23
39

votes

4 answers

How to read feather/arrow file natively?

I have feather format file sales.feather that I am using for exchanging data between python and R. In R I use the following command: df = arrow::read_feather("sales.feather", as_data_frame=TRUE) In python I used that: df =…

apache-spark pyspark pyarrow apache-arrow feather

asked Dec 01 '18 at 09:49

jangorecki

16,384
4
79
160

votes

1 answer

writing pandas dataframe with timedeltas to parquet

I can't seem to write a pandas dataframe containing timedeltas to a parquet file through pyarrow. The pyarrow documentation specifies that it can handle numpy timedeltas64 with ms precision. However, when I build a dataframe from numpy's…

python pandas parquet pyarrow

asked Jul 13 '18 at 19:29

Swier

4,047
3
28
52

votes

1 answer

Writing parquet files from Python without pandas

I need to transform data from JSON to parquet as a part of an ETL pipeline. I'm currently doing it with the from_pandas method of a pyarrow.Table. However building a dataframe first feels like a unnecessary step, plus I'd like to avoid having…

python parquet pyarrow

asked May 04 '18 at 12:00

Milan Cermak

7,476
3
44
59

votes

4 answers

How to save a huge pandas dataframe to hdfs?

Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently im converting my pandas dataframe to a spark dataframe like this: dataframe…

python pandas apache-spark pyarrow apache-arrow

asked Nov 20 '17 at 13:19

Mulgard

9,877
34
129
232

votes

3 answers

Can't install pyarrow on OSX / Python 3.9: is this me or an incompatible package?

I'm trying to install pyarrow with pip3 on OSX 11.0.1, and getting error messages. I'm using Python 3.9 and not sure if that is the problem. Here is the error summary: ERROR: Command errored out with exit status 1: command:…

python pyarrow

asked Nov 21 '20 at 23:57

Richard

62,943
126
334
542

votes

2 answers

Fastest way to construct pyarrow table row by row

I have a large dictionary that I want to iterate through to build a pyarrow table. The values of the dictionary are tuples of varying types and need to be unpacked and stored in separate columns in the final pyarrow table. I do know the schema ahead…

python pyarrow apache-arrow

asked Sep 14 '19 at 20:37

Josh W.

1,123
1
10
17

Prev 1

…

71 72 Next