Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions

votes

2 answers

Pandas DataFrame with categorical columns from a Parquet file using read_parquet?

I am converting large CSV files into Parquet files for further analysis. I read in the CSV data into Pandas and specify the column dtypes as follows _dtype = {"column_1": "float64", "column_2": "category", "column_3": "int64", …

asked Feb 17 '19 at 08:23

davidrpugh

4,363
5
32
46

votes

2 answers

pyarrow error: toPandas attempted Arrow optimization

when I set pyarrow to true we using spark session, but when I run toPandas(), it throws the error: "toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this" May I know…

pyspark pyarrow

asked Aug 28 '18 at 06:16

user5768866

votes

2 answers

Fastest way to write numpy array in arrow format

I'm looking for fast ways to store and retrieve numpy array using pyarrow. I'm pretty satisfied with retrieval. It takes less than 1 second to extract columns from my .arrow file that contains 1.000.000.000 integers of dtype = np.uint16. import…

python numpy pyarrow

asked Nov 09 '21 at 16:44

mathfux

5,759
1
14
34

votes

2 answers

Failed building wheel for pyarrow

I am trying to pip install Superset pip install apache-superset and getting below error Traceback (most recent call last): File "c:\users\saurav_nimesh\appdata\local\programs\python\python38\lib\runpy.py", line 193, in _run_module_as_main …

cmake pyarrow apache-superset

asked Feb 27 '20 at 16:41

saurav nimesh

votes

3 answers

Fastest way to iterate Pyarrow Table

I am using Pyarrow library for optimal storage of Pandas DataFrame. I need to process pyarrow Table row by row as fast as possible without converting it to pandas DataFrame (it won't fit in memory). Pandas has iterrows()/iterrtuples() methods. Is…

pandas pyarrow

asked Nov 05 '18 at 15:37

Alexandr Proskurin

votes

3 answers

How to write a partitioned Parquet file using Pandas

I'm trying to write a Pandas dataframe to a partitioned file: df.to_parquet('output.parquet', engine='pyarrow', partition_cols = ['partone', 'partwo']) TypeError: __cinit__() got an unexpected keyword argument 'partition_cols' From the…

python pandas parquet pyarrow

asked Oct 22 '18 at 16:56

Ivan

19,560
31
97
141

votes

2 answers

How to read parquet file with a condition using pyarrow in Python

I have created a parquet file with three columns (id, author, title) from database and want to read the parquet file with a condition (title='Learn Python'). Below mentioned is the python code which I am using for this POC. import pyarrow as…

python filter conditional-statements parquet pyarrow

asked Feb 09 '18 at 22:06

SSingh

votes

2 answers

Reading specific partitions from a partitioned parquet dataset with pyarrow

I have a somewhat large (~20 GB) partitioned dataset in parquet format. I would like to read specific partitions from the dataset using pyarrow. I thought I could accomplish this with pyarrow.parquet.ParquetDataset, but that doesn't seem to be the…

python parquet pyarrow apache-arrow

asked Dec 28 '17 at 05:29

suvayu

4,271
2
29
35

votes

4 answers

Read Parquet file stored in S3 with AWS Lambda (Python 3)

I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is: https://github.com/lambci/docker-lambda as a container to mock the Amazon environment, because of the native libraries that need to be…

python amazon-s3 aws-lambda parquet pyarrow

asked Dec 26 '17 at 22:22

Ptah

votes

3 answers

Different behavior while reading DataFrame from parquet using CLI Versus executable on same environment

Please consider following program as Minimal Reproducible Example -MRE: import pandas as pd import pyarrow from pyarrow import parquet def foo(): print(pyarrow.__file__) print('version:',pyarrow.cpp_version) …

python pandas pyinstaller parquet pyarrow

asked Jul 22 '21 at 13:52

ThePyGuy

17,779
5
18
45

votes

1 answer

How to use the new Int64 pandas object when saving to a parquet file

I am converting data from CSV to Parquet using Python (Pandas) to later load it into Google BigQuery. I have some integer columns that contain missing values and since Pandas 0.24.0 I can store them as Int64 dtype. Is there a way to use Int64 dtype…

python google-bigquery parquet pyarrow

asked Jun 03 '19 at 14:26

dhafnar

votes

1 answer

Unable to load libhdfs when using pyarrow

I'm trying to connect to HDFS through Pyarrow, but it does not work because libhdfs library cannot be loaded. libhdfs.so is in $HADOOP_HOME/lib/native as well as in $ARROW_LIBHDFS_DIR. print(os.environ['ARROW_LIBHDFS_DIR']) fs =…

python hadoop hdfs pyarrow apache-arrow

asked Oct 31 '18 at 16:11

Pablo Velasquez

votes

6 answers

ModuleNotFoundError: No module named 'pyarrow'

I am trying to run a simple pandas UDF example on my server. From here I have created a fresh environment just for the purpose of running this code. (PySparkEnv) $ conda list # packages in environment at /home/shekhar/.conda/envs/PySparkEnv: # #…

python-3.x pyspark pyarrow

asked Sep 13 '18 at 19:12

spartacus

votes

1 answer

Handling UUID values in Arrow with Parquet files

I'm new to Python and Pandas - please be gentle! I'm using SqlAlchemy with pymssql to execute a SQL query against a SQL Server database and then convert the result set into a dataframe. I'm then attempting to write this dataframe as a Parquet file: …

python pandas pyarrow

asked Sep 05 '21 at 22:55

Chris Wood

votes

4 answers

finding nested columns in pandas dataframe

I have a large dataset with many columns in (compressed) JSON format. I'm trying to convert it to parquet for subsequent processing. Some columns have a nested structure. For now I want to ignore this structure and just write those columns out as a…

python python-3.x pandas pyarrow

asked Apr 13 '20 at 20:24

Daniel Kats

5,141
15
65
102

Prev 1 2

…

71 72 Next