Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions

votes

1 answer

pip install pyarrow failing in Linux / Inside a docker

I tried installing pyarrow and it's failing with the below error. I also tried the option --no-binary :all: and still the same problem. Any help to resolve this will really help me. Python version: 3.7 Linux version: python:3.7-alpine Below is the…

python-3.x pip pyarrow

asked Nov 23 '20 at 08:24

Ravi Sastry Panyam

votes

2 answers

Create Parquet files from stream in python in memory-efficient manner

It appears the most common way in Python to create Parquet files is to first create a Pandas dataframe and then use pyarrow to write the table to parquet. I worry that this might be overly taxing in memory usage - as it requires at least one full…

python parquet pyarrow fastparquet

asked Nov 11 '20 at 17:48

aaronsteers

2,277
2
21
38

votes

2 answers

Problem running a Pandas UDF on a large dataset

I'm currently working on a project and I am having a hard time understanding how does the Pandas UDF in PySpark works. I have a Spark Cluster with one Master node with 8 cores and 64GB, along with two workers of 16 cores each and 112GB. My dataset…

python apache-spark pyspark pyarrow

asked Dec 26 '19 at 20:53

naifmeh

votes

2 answers

AWS EMR - ModuleNotFoundError: No module named 'pyarrow'

I am running into this problem w/ Apache Arrow Spark Integration. Using AWS EMR w/ Spark 2.4.3 Tested this problem on both local spark single machine instance and a Cloudera cluster and everything works fine. set these in spark-env.sh export…

apache-spark pyspark amazon-emr pyarrow apache-arrow

asked Aug 01 '19 at 18:28

thePurplePython

2,621
1
13
34

votes

1 answer

Using data to construct Table. Avoid creating dataframe

Pandas dataframe is heavy weight so I want to avoid that. But I want to construct Pyarrow Table in order to store the data in parquet format. I search and read the documentation and I try to use the from_array() but it is not working.…

pyarrow

asked Jun 17 '19 at 21:24

Zichu Lee

votes

2 answers

Linux pyarrow undefined symbol

I am running Python 3.7.2 and using Miniconda3 to create a new environment named test-env. I have installed the pyarrow package from the default channel into this environment; however, when I try and import pyarrow, the Python interpreter gives me…

python-3.x pyarrow

asked Mar 07 '19 at 19:27

Nester

votes

2 answers

Datatypes issue when convert parquet data to pandas dataframe

I have a problem with filetypes when converting a parquet file to a dataframe. I do bucket = 's3://some_bucket/test/usages' import pyarrow.parquet as pq import s3fs s3 = s3fs.S3FileSystem() read_pq = pq.ParquetDataset(bucket,…

pandas parquet pyarrow apache-arrow

asked Feb 25 '19 at 12:45

clog14

1,549
1
16
32

votes

1 answer

Streaming parquet file python and only downsampling

I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe? Ultimately, I would like to have…

python-3.x parquet pyarrow fastparquet

asked Jan 02 '19 at 15:28

Sjoseph

votes

2 answers

Memory leak from pyarrow?

For the parsing of a larger file, I need to write in a loop to a large number of parquet files successively. However, it appears that the memory consumed by this task increases over each iteration, whereas I would expect it to remain constant (as…

python pandas parquet pyarrow

asked Oct 26 '18 at 22:01

Abel Riboulot

votes

2 answers

Pyarrow s3fs partition by timestamp

Is it possible to use a timestamp field in the pyarrow table to partition the s3fs file system by "YYYY/MM/DD/HH" while writing parquet file to s3?

python pyarrow

asked Mar 03 '18 at 15:19

thotam

votes

1 answer

How to convert a pandas dataframe to a an arrow dataset?

In huggingface library, there is a particular format of datasets called arrow dataset https://arrow.apache.org/docs/python/dataset.html https://huggingface.co/datasets/wiki_lingua I have to convert a normal pandas dataframe to a dataset or read a…

pandas pyarrow

asked Nov 08 '21 at 04:20

Zenith_Raven

votes

1 answer

How can I chunk through a CSV using Arrow?

What I am trying to do I am using PyArrow to read some CSVs and convert them to Parquet. Some of the files I read have plenty of columns and have a high memory footprint (enough to crash the machine running the job). I am trying to chunk through the…

python pyarrow apache-arrow

asked Jul 28 '21 at 05:54

alt-f4

2,112
17
49

votes

0 answers

How to fix timestamp interpretation in python pandas of parquet files

I have some spark(scala) dataframes/tables with timestamps which are coming from our DHW and which are using some High Watermarks some times. I want to work with this data in python with pandas so I write them as parquet files from spark and read it…

python pandas timestamp parquet pyarrow

asked Oct 09 '20 at 08:55

Dominik Steinschaden

votes

3 answers

How can I stream more data than will fit in memory from a PostgreSQL query to a parquet file?

I have the below code which queries a database of about 500k rows. and it throws a SIGKILL when it hits rows = cur.fetchall(). I've tried to iterate through the cursor rather than load it all up into rows, but it still seems to cause OOM issues. How…

python psycopg2 parquet pyarrow

asked Sep 02 '20 at 20:38

Caleb Yoon

votes

1 answer

Write nested parquet format from Python

I have a flat parquet file where one varchar columns store JSON data as a string and I want to transform this data to a nested structure, i.e. the JSON data becomes nested parquet. I know the schema of the JSON in advance if this is of any…

python json parquet pyarrow fastparquet

asked Jul 06 '20 at 06:41

Stephan Claus

Prev 1 2 3

…

71 72 Next