Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions

votes

1 answer

Do memory mapped files in Docker containers in Kubernetes work the same as in regular processes in Linux?

I have process A and process B. Process A opens a file, calls mmap and write to it, process B do the same but reads the same mapped region when process A has finished writing. Using mmap, process B is suppossed to read the file from memory instead…

asked Sep 18 '20 at 22:56

rboc

votes

1 answer

PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

I'm trying to write a large parquet file onto disk (larger then memory). I naively thought I can be clever and use ParquetWriter and write_table to incrementally write a file, like this (POC): import pyarrow as pa import pyarrow.parquet as pq import…

python parquet pyarrow apache-arrow

asked Sep 14 '20 at 20:11

Niklas B

1,839
18
36

votes

2 answers

how to load modin dataframe from pyarrow or pandas

Since Modin does not support loading from multiple pyarrow files on s3, I am using pyarrow to load the data. import s3fs import modin.pandas as pd from pyarrow import parquet s3 = s3fs.S3FileSystem( key=aws_key, …

pyarrow modin

asked Sep 02 '20 at 12:23

galinden

votes

1 answer

pyarrow data types for columns that have lists of dictionaries?

Is there a special pyarrow data type I should use for columns which have lists of dictionaries when I save to a parquet file? If I save lists or lists of dictionaries as a string, I normally have to .apply(eval) the field if I read it into memory…

pandas parquet pyarrow

asked Aug 24 '20 at 01:44

trench

5,075
12
50
80

votes

1 answer

PySpark 2.4.5: IllegalArgumentException when using PandasUDF

I am trying Pandas UDF and facing the IllegalArgumentException. I also tried replicating examples from PySpark Documentation GroupedData to check but still getting the error. Following is the environment configuration python3.7 Installed…

python pandas apache-spark pyspark pyarrow

asked Apr 14 '20 at 06:41

jaykay

votes

1 answer

PyArrow 0.16.0 fs.HadoopFileSystem throws HDFS connection failed

I am currently migrating the old Arrow Filesystem Interface: http://arrow.apache.org/docs/python/filesystems_deprecated.html to the new Filesystem Interface: http://arrow.apache.org/docs/python/filesystems.html I am trying to connect to HDFS using…

hadoop connection hdfs pyarrow

asked Mar 31 '20 at 15:26

Sephixx

votes

2 answers

Cannot import pyarrow in pyspark

I am trying to use pyarrow in with pyspark. However when I try to execute import pyarrow I receive the following error In [1]: import pyarrow --------------------------------------------------------------------------- ImportError …

pyspark pyarrow

asked Feb 24 '20 at 09:10

Galuoises

2,630
24
30

votes

0 answers

PySpark 2.4.4 toPandas fail with ValueError not enough values to unpack (expected 3, got 2)

On spark dataframe when I do a 'toPandas' i end up with this error: pandas_df = spark_df.toPandas() File "/opt/mapr/spark/spark-2.4.4/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 2122, in toPandas File…

python apache-spark-sql pyspark pyarrow

asked Jan 18 '20 at 05:55

Manoj Srivatsav

votes

2 answers

error: command 'cmake' failed: No such file or directory

Getting error while installing vaex in Pycharm with Python3.8 I have installed below before running this on my Win-10 64-bit: - cmake v3.15.3 - pep517 v0.8.1 - pip v19.3.1 Error logs: running build_ext creating build\temp.win-amd64-3.8 creating…

python cmake pyarrow python-3.8 vaex

asked Dec 07 '19 at 10:54

user1222006

votes

1 answer

Converted apache arrow file from data frame gives null while reading with arrow.js

I converted one sample dataframe to .arrow file using pyarrow import numpy as np import pandas as pd import pyarrow as pa df = pd.DataFrame({"a": [10, 2, 3]}) df['a'] = pd.to_numeric(df['a'],errors='coerce') table = pa.Table.from_pandas(df) writer…

python node.js pyarrow apache-arrow

asked Oct 09 '19 at 22:48

Sarath

9,030
11
51
84

votes

1 answer

Over-high memory usage during reading parquet in Python

I have a parquet file at around 10+GB, with columns are mainly strings. When loading it into the memory, the memory usage can peak to 110G, while after it's finished the memory usage is reduced back to around 40G. I'm working on a high-performance…

python memory-management parquet pyarrow

asked Aug 09 '19 at 19:36

SymbolRanger

votes

2 answers

Reading Parquet File with Array> Column

I'm using Dask to read a Parquet file that was generated by PySpark, and one of the columns is a list of dictionaries (i.e. array>'). An example of the df would be: import pandas as pd df = pd.DataFrame.from_records([ (1,…

python dask python-3.7 pyarrow fastparquet

asked Jul 14 '19 at 02:06

Jon.H

votes

2 answers

parquet file size, firehose vs. spark

I'm generating Parquet files via two methods: a Kinesis Firehose and a Spark job. They are both written into the same partition structure on S3. Both sets of data can be queried using the same Athena table definition. Both use gzip…

apache-spark parquet amazon-kinesis-firehose pyarrow

asked Jun 28 '19 at 21:42

jph

2,181
3
30
55

votes

1 answer

Is there a Python module to read avro files with pyarrow?

I know there is pyarrow.parquet for reading parquet files as arrow table but i'm looking for the equivalent for avro?

pyarrow apache-arrow

asked Jun 05 '19 at 15:29

djohon

votes

0 answers

Writing stream of big data to Parquet with Python

I want to write a stream of big data to a parquet file with Python. My data is huge, and I cannot keep them in memory and write them in one go. I find two Python libraries (Pyarrow, Fastparquet) which could read and write on a Parquet file. This…

python bigdata streaming parquet pyarrow

asked May 30 '19 at 11:54

Mohsen Laali

Prev 1 2 3

…

71 72 Next