Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions
12
votes
2 answers

Pandas DataFrame with categorical columns from a Parquet file using read_parquet?

I am converting large CSV files into Parquet files for further analysis. I read in the CSV data into Pandas and specify the column dtypes as follows _dtype = {"column_1": "float64", "column_2": "category", "column_3": "int64", …
davidrpugh
  • 4,363
  • 5
  • 32
  • 46
12
votes
2 answers

pyarrow error: toPandas attempted Arrow optimization

when I set pyarrow to true we using spark session, but when I run toPandas(), it throws the error: "toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this" May I know…
user5768866
11
votes
2 answers

Fastest way to write numpy array in arrow format

I'm looking for fast ways to store and retrieve numpy array using pyarrow. I'm pretty satisfied with retrieval. It takes less than 1 second to extract columns from my .arrow file that contains 1.000.000.000 integers of dtype = np.uint16. import…
mathfux
  • 5,759
  • 1
  • 14
  • 34
11
votes
2 answers

Failed building wheel for pyarrow

I am trying to pip install Superset pip install apache-superset and getting below error Traceback (most recent call last): File "c:\users\saurav_nimesh\appdata\local\programs\python\python38\lib\runpy.py", line 193, in _run_module_as_main …
saurav nimesh
  • 113
  • 1
  • 1
  • 5
11
votes
3 answers

Fastest way to iterate Pyarrow Table

I am using Pyarrow library for optimal storage of Pandas DataFrame. I need to process pyarrow Table row by row as fast as possible without converting it to pandas DataFrame (it won't fit in memory). Pandas has iterrows()/iterrtuples() methods. Is…
Alexandr Proskurin
  • 217
  • 1
  • 2
  • 7
11
votes
3 answers

How to write a partitioned Parquet file using Pandas

I'm trying to write a Pandas dataframe to a partitioned file: df.to_parquet('output.parquet', engine='pyarrow', partition_cols = ['partone', 'partwo']) TypeError: __cinit__() got an unexpected keyword argument 'partition_cols' From the…
Ivan
  • 19,560
  • 31
  • 97
  • 141
11
votes
2 answers

How to read parquet file with a condition using pyarrow in Python

I have created a parquet file with three columns (id, author, title) from database and want to read the parquet file with a condition (title='Learn Python'). Below mentioned is the python code which I am using for this POC. import pyarrow as…
SSingh
  • 199
  • 2
  • 11
11
votes
2 answers

Reading specific partitions from a partitioned parquet dataset with pyarrow

I have a somewhat large (~20 GB) partitioned dataset in parquet format. I would like to read specific partitions from the dataset using pyarrow. I thought I could accomplish this with pyarrow.parquet.ParquetDataset, but that doesn't seem to be the…
suvayu
  • 4,271
  • 2
  • 29
  • 35
11
votes
4 answers

Read Parquet file stored in S3 with AWS Lambda (Python 3)

I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is: https://github.com/lambci/docker-lambda as a container to mock the Amazon environment, because of the native libraries that need to be…
Ptah
  • 571
  • 1
  • 5
  • 8
10
votes
3 answers

Different behavior while reading DataFrame from parquet using CLI Versus executable on same environment

Please consider following program as Minimal Reproducible Example -MRE: import pandas as pd import pyarrow from pyarrow import parquet def foo(): print(pyarrow.__file__) print('version:',pyarrow.cpp_version) …
ThePyGuy
  • 17,779
  • 5
  • 18
  • 45
10
votes
1 answer

How to use the new Int64 pandas object when saving to a parquet file

I am converting data from CSV to Parquet using Python (Pandas) to later load it into Google BigQuery. I have some integer columns that contain missing values and since Pandas 0.24.0 I can store them as Int64 dtype. Is there a way to use Int64 dtype…
dhafnar
  • 165
  • 2
  • 7
10
votes
1 answer

Unable to load libhdfs when using pyarrow

I'm trying to connect to HDFS through Pyarrow, but it does not work because libhdfs library cannot be loaded. libhdfs.so is in $HADOOP_HOME/lib/native as well as in $ARROW_LIBHDFS_DIR. print(os.environ['ARROW_LIBHDFS_DIR']) fs =…
Pablo Velasquez
  • 111
  • 1
  • 1
  • 5
10
votes
6 answers

ModuleNotFoundError: No module named 'pyarrow'

I am trying to run a simple pandas UDF example on my server. From here I have created a fresh environment just for the purpose of running this code. (PySparkEnv) $ conda list # packages in environment at /home/shekhar/.conda/envs/PySparkEnv: # #…
spartacus
  • 127
  • 1
  • 1
  • 4
9
votes
1 answer

Handling UUID values in Arrow with Parquet files

I'm new to Python and Pandas - please be gentle! I'm using SqlAlchemy with pymssql to execute a SQL query against a SQL Server database and then convert the result set into a dataframe. I'm then attempting to write this dataframe as a Parquet file: …
Chris Wood
  • 111
  • 3
9
votes
4 answers

finding nested columns in pandas dataframe

I have a large dataset with many columns in (compressed) JSON format. I'm trying to convert it to parquet for subsequent processing. Some columns have a nested structure. For now I want to ignore this structure and just write those columns out as a…
Daniel Kats
  • 5,141
  • 15
  • 65
  • 102
1 2
3
71 72