Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions
7
votes
1 answer

pip install pyarrow failing in Linux / Inside a docker

I tried installing pyarrow and it's failing with the below error. I also tried the option --no-binary :all: and still the same problem. Any help to resolve this will really help me. Python version: 3.7 Linux version: python:3.7-alpine Below is the…
7
votes
2 answers

Create Parquet files from stream in python in memory-efficient manner

It appears the most common way in Python to create Parquet files is to first create a Pandas dataframe and then use pyarrow to write the table to parquet. I worry that this might be overly taxing in memory usage - as it requires at least one full…
aaronsteers
  • 2,277
  • 2
  • 21
  • 38
7
votes
2 answers

Problem running a Pandas UDF on a large dataset

I'm currently working on a project and I am having a hard time understanding how does the Pandas UDF in PySpark works. I have a Spark Cluster with one Master node with 8 cores and 64GB, along with two workers of 16 cores each and 112GB. My dataset…
naifmeh
  • 408
  • 5
  • 15
7
votes
2 answers

AWS EMR - ModuleNotFoundError: No module named 'pyarrow'

I am running into this problem w/ Apache Arrow Spark Integration. Using AWS EMR w/ Spark 2.4.3 Tested this problem on both local spark single machine instance and a Cloudera cluster and everything works fine. set these in spark-env.sh export…
thePurplePython
  • 2,621
  • 1
  • 13
  • 34
7
votes
1 answer

Using data to construct Table. Avoid creating dataframe

Pandas dataframe is heavy weight so I want to avoid that. But I want to construct Pyarrow Table in order to store the data in parquet format. I search and read the documentation and I try to use the from_array() but it is not working.…
Zichu Lee
  • 107
  • 1
  • 5
7
votes
2 answers

Linux pyarrow undefined symbol

I am running Python 3.7.2 and using Miniconda3 to create a new environment named test-env. I have installed the pyarrow package from the default channel into this environment; however, when I try and import pyarrow, the Python interpreter gives me…
Nester
  • 149
  • 2
  • 4
7
votes
2 answers

Datatypes issue when convert parquet data to pandas dataframe

I have a problem with filetypes when converting a parquet file to a dataframe. I do bucket = 's3://some_bucket/test/usages' import pyarrow.parquet as pq import s3fs s3 = s3fs.S3FileSystem() read_pq = pq.ParquetDataset(bucket,…
clog14
  • 1,549
  • 1
  • 16
  • 32
7
votes
1 answer

Streaming parquet file python and only downsampling

I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe? Ultimately, I would like to have…
Sjoseph
  • 853
  • 2
  • 14
  • 23
7
votes
2 answers

Memory leak from pyarrow?

For the parsing of a larger file, I need to write in a loop to a large number of parquet files successively. However, it appears that the memory consumed by this task increases over each iteration, whereas I would expect it to remain constant (as…
Abel Riboulot
  • 158
  • 1
  • 8
7
votes
2 answers

Pyarrow s3fs partition by timestamp

Is it possible to use a timestamp field in the pyarrow table to partition the s3fs file system by "YYYY/MM/DD/HH" while writing parquet file to s3?
thotam
  • 941
  • 2
  • 16
  • 31
6
votes
1 answer

How to convert a pandas dataframe to a an arrow dataset?

In huggingface library, there is a particular format of datasets called arrow dataset https://arrow.apache.org/docs/python/dataset.html https://huggingface.co/datasets/wiki_lingua I have to convert a normal pandas dataframe to a dataset or read a…
Zenith_Raven
  • 85
  • 1
  • 3
6
votes
1 answer

How can I chunk through a CSV using Arrow?

What I am trying to do I am using PyArrow to read some CSVs and convert them to Parquet. Some of the files I read have plenty of columns and have a high memory footprint (enough to crash the machine running the job). I am trying to chunk through the…
alt-f4
  • 2,112
  • 17
  • 49
6
votes
0 answers

How to fix timestamp interpretation in python pandas of parquet files

I have some spark(scala) dataframes/tables with timestamps which are coming from our DHW and which are using some High Watermarks some times. I want to work with this data in python with pandas so I write them as parquet files from spark and read it…
6
votes
3 answers

How can I stream more data than will fit in memory from a PostgreSQL query to a parquet file?

I have the below code which queries a database of about 500k rows. and it throws a SIGKILL when it hits rows = cur.fetchall(). I've tried to iterate through the cursor rather than load it all up into rows, but it still seems to cause OOM issues. How…
Caleb Yoon
  • 131
  • 6
6
votes
1 answer

Write nested parquet format from Python

I have a flat parquet file where one varchar columns store JSON data as a string and I want to transform this data to a nested structure, i.e. the JSON data becomes nested parquet. I know the schema of the JSON in advance if this is of any…
Stephan Claus
  • 405
  • 1
  • 6
  • 16