Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions

votes

2 answers

Generate parquet from CSV on the fly in Python

I have multiple very large datasets stored in CSV format on a S3 bucket. I need to convert these CSV to Apache Parquet files. I don't have (nor want) any Spark cluster, so correct me if I'm wrong, but it seems to me that pyspark cannot be of any…

asked Sep 12 '19 at 16:08

T. C.

votes

1 answer

Extremely high memory usage with pyarrow reading gzipped parquet files

I have a (set of) gzipped parquet files with about 210 columns, of which I am loading about 100 columns into a pandas dataframe. It works fine and very fast when the file size is about 1 MB (with about 50 rows); the python3 process consumes < 500 MB…

pandas parquet pyarrow

asked Sep 06 '19 at 00:27

ITnotIT

votes

2 answers

Allow duplicate columns pandas / refresh column dtypes upon updating column header

I'm creating a dataframe out of a string data, the header of which have duplicate columns. Because of pandas default check to auto-rename in case of duplicate columns, it adds '.1, .2, and so on' suffix to each duplicate. formatted_data =…

python pandas pyarrow

asked Sep 05 '19 at 17:28

Krunal Patel

votes

0 answers

How to persist kdb tables to compressed parquet?

I'm trying to store/persist kdb tables in compressed apache parquet format. My initial plan is basically to use embedPy to convert either fastparquet or pyarrow.parquet to be usable from within q. I'll then use the kdb+ tick architecture to process…

parquet kdb pyarrow fastparquet

asked Sep 03 '19 at 22:30

Natalie Williams

votes

1 answer

Redshift spectrum incorrectly parsing Pyarrow datetime64[ns]

I have an external table in Redshift spectrum with DDL having datetime column as somewhat below: collector_tstamp TIMESTAMP WITHOUT TIME ZONE Objective: I am trying to parquet a certain set of data and then add the partition into Spectrum to see if…

pandas pyarrow amazon-redshift-spectrum

asked Aug 31 '19 at 02:10

Gagan

1,775
5
31
59

votes

1 answer

till dask 2.2.0 read_parquet filters parameter doesn't seem to work anymore with pyarrow engine

when i upgraded dask from 2.1.0 to 2.2.0 (or 2.3.0), the following code changed its behaviour and stopped filtering parquet files as it did before. This is only appening with pyarrow engine (fastparquet engine is still filtering well). I tried…

dask parquet pyarrow

asked Aug 28 '19 at 07:24

denren

votes

2 answers

Is date conversion not implemented when reading CSV using pyarrow?

I want to use pyarrow 0.14.1 in Python 3.6 to read a CSV file which has a column called Date where the date values are in YYYY-MM-DD format (e.g. 2018-11-17). I want to convert the date values to date32() format using ConvertOptions.columntypes as…

csv pyarrow

asked Aug 28 '19 at 00:37

Tooleojim

votes

2 answers

Apply a function over a column in a group in PySpark dataframe

I have a PySpark dataframe like this, +----------+--------+---------+ |id_ | p | a | +----------+--------+---------+ | 1 | 4 | 12 | | 1 | 3 | 14 | | 1 | -7 | 16 | | 1 | 5 …

python pyspark pyarrow

asked Aug 22 '19 at 06:20

Sreeram TP

11,346
7
54
108

votes

1 answer

Why does feather need pyarrow? (or: How to load feather data without downgrading to pandas 24?)

I get this error message: Missing optional dependency 'pyarrow'. Use pip or conda to install pyarrow. when I run a simple command to load feather data, ie: pd.read_feather("data.feather"). Surely I can install pyarrow from conda-forge, but that…

python pandas pyarrow feather

asked Aug 13 '19 at 21:21

Martien Lubberink

2,614
1
19
31

votes

0 answers

pyarrow: .parquet file that used to work perfectly is now unreadable

pyarrow: .parquet file that used to work perfectly is now unreadable. The file was created with pandas a few days ago. when trying to read the file: pd.read_parquet(filename) I get: ArrowIOError: Corrupted file, smaller than file footer What can…

python pandas parquet pyarrow

asked Aug 08 '19 at 10:59

Dror Hilman

6,837
9
39
56

votes

1 answer

What is the root cause of PyArrow HDFS IO error?

I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code shown in traceback below) using PyArrow's HDFS IO library. However, the job intermittently runs into the error shown below, not every run, only sometimes. I'm unable to…

hdfs dask pyarrow

asked Aug 06 '19 at 18:07

aaron02

votes

1 answer

pyarrow read_table has no 'parquet version' parameter

Using pyarrow I can write parquet files of version 2.0. pyarrow.parquet.write_table method has parameter 'version'. But there is no parameter 'version' for pyarrow.parquet.read_table method. And seems like it only can read parquet files of version…

python pandas parquet pyarrow

asked Jul 26 '19 at 11:58

gs_vlad

1,409
4
15
29

votes

1 answer

Converting NaN floats to other types in Parquet format

I currently am processing a bunch of CSV files and transforming them into Parquet. I use these with Hive and query the files directly. I would like to switch over to Dask for my data processing. My data I am reading has optional columns some of…

pandas dask pyarrow fastparquet

asked Jul 25 '19 at 20:33

Eumcoz

2,388
1
21
44

votes

1 answer

Unable to connect to HDFS from a worker/data node on a Kerberized cluster using pyarrow' hdfs API

Here's what I'm trying: import pyarrow as pa conf = {"hadoop.security.authentication": "kerberos"} fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_44444", extra_conf=conf) However, when I submit this job to the cluster using Dask-YARN, I get the…

hdfs dask pyarrow

asked Jul 11 '19 at 22:51

Saurabh

votes

1 answer

Couldn't build egg file for pyarrow

Issue: Couldn't build egg file for pyarrow, tried with 0.12.1 and 0.13 versions of pyarrow. Could you please help me to understand if I miss anything? $ python setup.py bdist_egg Log Trace: running bdist_egg running egg_info writing entry points…

python python-3.x cmake aws-glue pyarrow

asked Jul 03 '19 at 09:57

Naga Budigam

Prev 1 2 3

…

71 72 Next