Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions
5
votes
1 answer

Do memory mapped files in Docker containers in Kubernetes work the same as in regular processes in Linux?

I have process A and process B. Process A opens a file, calls mmap and write to it, process B do the same but reads the same mapped region when process A has finished writing. Using mmap, process B is suppossed to read the file from memory instead…
rboc
  • 344
  • 3
  • 10
5
votes
1 answer

PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

I'm trying to write a large parquet file onto disk (larger then memory). I naively thought I can be clever and use ParquetWriter and write_table to incrementally write a file, like this (POC): import pyarrow as pa import pyarrow.parquet as pq import…
Niklas B
  • 1,839
  • 18
  • 36
5
votes
2 answers

how to load modin dataframe from pyarrow or pandas

Since Modin does not support loading from multiple pyarrow files on s3, I am using pyarrow to load the data. import s3fs import modin.pandas as pd from pyarrow import parquet s3 = s3fs.S3FileSystem( key=aws_key, …
galinden
  • 610
  • 8
  • 13
5
votes
1 answer

pyarrow data types for columns that have lists of dictionaries?

Is there a special pyarrow data type I should use for columns which have lists of dictionaries when I save to a parquet file? If I save lists or lists of dictionaries as a string, I normally have to .apply(eval) the field if I read it into memory…
trench
  • 5,075
  • 12
  • 50
  • 80
5
votes
1 answer

PySpark 2.4.5: IllegalArgumentException when using PandasUDF

I am trying Pandas UDF and facing the IllegalArgumentException. I also tried replicating examples from PySpark Documentation GroupedData to check but still getting the error. Following is the environment configuration python3.7 Installed…
jaykay
  • 371
  • 1
  • 2
  • 10
5
votes
1 answer

PyArrow 0.16.0 fs.HadoopFileSystem throws HDFS connection failed

I am currently migrating the old Arrow Filesystem Interface: http://arrow.apache.org/docs/python/filesystems_deprecated.html to the new Filesystem Interface: http://arrow.apache.org/docs/python/filesystems.html I am trying to connect to HDFS using…
Sephixx
  • 117
  • 1
  • 8
5
votes
2 answers

Cannot import pyarrow in pyspark

I am trying to use pyarrow in with pyspark. However when I try to execute import pyarrow I receive the following error In [1]: import pyarrow --------------------------------------------------------------------------- ImportError …
Galuoises
  • 2,630
  • 24
  • 30
5
votes
0 answers

PySpark 2.4.4 toPandas fail with ValueError not enough values to unpack (expected 3, got 2)

On spark dataframe when I do a 'toPandas' i end up with this error: pandas_df = spark_df.toPandas() File "/opt/mapr/spark/spark-2.4.4/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 2122, in toPandas File…
Manoj Srivatsav
  • 280
  • 2
  • 11
5
votes
2 answers

error: command 'cmake' failed: No such file or directory

Getting error while installing vaex in Pycharm with Python3.8 I have installed below before running this on my Win-10 64-bit: - cmake v3.15.3 - pep517 v0.8.1 - pip v19.3.1 Error logs: running build_ext creating build\temp.win-amd64-3.8 creating…
user1222006
  • 159
  • 1
  • 3
  • 11
5
votes
1 answer

Converted apache arrow file from data frame gives null while reading with arrow.js

I converted one sample dataframe to .arrow file using pyarrow import numpy as np import pandas as pd import pyarrow as pa df = pd.DataFrame({"a": [10, 2, 3]}) df['a'] = pd.to_numeric(df['a'],errors='coerce') table = pa.Table.from_pandas(df) writer…
Sarath
  • 9,030
  • 11
  • 51
  • 84
5
votes
1 answer

Over-high memory usage during reading parquet in Python

I have a parquet file at around 10+GB, with columns are mainly strings. When loading it into the memory, the memory usage can peak to 110G, while after it's finished the memory usage is reduced back to around 40G. I'm working on a high-performance…
5
votes
2 answers

Reading Parquet File with Array> Column

I'm using Dask to read a Parquet file that was generated by PySpark, and one of the columns is a list of dictionaries (i.e. array>'). An example of the df would be: import pandas as pd df = pd.DataFrame.from_records([ (1,…
Jon.H
  • 794
  • 2
  • 9
  • 23
5
votes
2 answers

parquet file size, firehose vs. spark

I'm generating Parquet files via two methods: a Kinesis Firehose and a Spark job. They are both written into the same partition structure on S3. Both sets of data can be queried using the same Athena table definition. Both use gzip…
jph
  • 2,181
  • 3
  • 30
  • 55
5
votes
1 answer

Is there a Python module to read avro files with pyarrow?

I know there is pyarrow.parquet for reading parquet files as arrow table but i'm looking for the equivalent for avro?
djohon
  • 705
  • 2
  • 10
  • 25
5
votes
0 answers

Writing stream of big data to Parquet with Python

I want to write a stream of big data to a parquet file with Python. My data is huge, and I cannot keep them in memory and write them in one go. I find two Python libraries (Pyarrow, Fastparquet) which could read and write on a Parquet file. This…
Mohsen Laali
  • 463
  • 1
  • 3
  • 17