Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions
0
votes
2 answers

Generate parquet from CSV on the fly in Python

I have multiple very large datasets stored in CSV format on a S3 bucket. I need to convert these CSV to Apache Parquet files. I don't have (nor want) any Spark cluster, so correct me if I'm wrong, but it seems to me that pyspark cannot be of any…
T. C.
  • 1
  • 1
  • 2
0
votes
1 answer

Extremely high memory usage with pyarrow reading gzipped parquet files

I have a (set of) gzipped parquet files with about 210 columns, of which I am loading about 100 columns into a pandas dataframe. It works fine and very fast when the file size is about 1 MB (with about 50 rows); the python3 process consumes < 500 MB…
ITnotIT
  • 316
  • 3
  • 10
0
votes
2 answers

Allow duplicate columns pandas / refresh column dtypes upon updating column header

I'm creating a dataframe out of a string data, the header of which have duplicate columns. Because of pandas default check to auto-rename in case of duplicate columns, it adds '.1, .2, and so on' suffix to each duplicate. formatted_data =…
Krunal Patel
  • 85
  • 1
  • 8
0
votes
0 answers

How to persist kdb tables to compressed parquet?

I'm trying to store/persist kdb tables in compressed apache parquet format. My initial plan is basically to use embedPy to convert either fastparquet or pyarrow.parquet to be usable from within q. I'll then use the kdb+ tick architecture to process…
Natalie Williams
  • 355
  • 1
  • 3
  • 9
0
votes
1 answer

Redshift spectrum incorrectly parsing Pyarrow datetime64[ns]

I have an external table in Redshift spectrum with DDL having datetime column as somewhat below: collector_tstamp TIMESTAMP WITHOUT TIME ZONE Objective: I am trying to parquet a certain set of data and then add the partition into Spectrum to see if…
Gagan
  • 1,775
  • 5
  • 31
  • 59
0
votes
1 answer

till dask 2.2.0 read_parquet filters parameter doesn't seem to work anymore with pyarrow engine

when i upgraded dask from 2.1.0 to 2.2.0 (or 2.3.0), the following code changed its behaviour and stopped filtering parquet files as it did before. This is only appening with pyarrow engine (fastparquet engine is still filtering well). I tried…
denren
  • 1
0
votes
2 answers

Is date conversion not implemented when reading CSV using pyarrow?

I want to use pyarrow 0.14.1 in Python 3.6 to read a CSV file which has a column called Date where the date values are in YYYY-MM-DD format (e.g. 2018-11-17). I want to convert the date values to date32() format using ConvertOptions.columntypes as…
Tooleojim
  • 1
  • 1
0
votes
2 answers

Apply a function over a column in a group in PySpark dataframe

I have a PySpark dataframe like this, +----------+--------+---------+ |id_ | p | a | +----------+--------+---------+ | 1 | 4 | 12 | | 1 | 3 | 14 | | 1 | -7 | 16 | | 1 | 5 …
Sreeram TP
  • 11,346
  • 7
  • 54
  • 108
0
votes
1 answer

Why does feather need pyarrow? (or: How to load feather data without downgrading to pandas 24?)

I get this error message: Missing optional dependency 'pyarrow'. Use pip or conda to install pyarrow. when I run a simple command to load feather data, ie: pd.read_feather("data.feather"). Surely I can install pyarrow from conda-forge, but that…
Martien Lubberink
  • 2,614
  • 1
  • 19
  • 31
0
votes
0 answers

pyarrow: .parquet file that used to work perfectly is now unreadable

pyarrow: .parquet file that used to work perfectly is now unreadable. The file was created with pandas a few days ago. when trying to read the file: pd.read_parquet(filename) I get: ArrowIOError: Corrupted file, smaller than file footer What can…
Dror Hilman
  • 6,837
  • 9
  • 39
  • 56
0
votes
1 answer

What is the root cause of PyArrow HDFS IO error?

I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code shown in traceback below) using PyArrow's HDFS IO library. However, the job intermittently runs into the error shown below, not every run, only sometimes. I'm unable to…
aaron02
  • 320
  • 2
  • 14
0
votes
1 answer

pyarrow read_table has no 'parquet version' parameter

Using pyarrow I can write parquet files of version 2.0. pyarrow.parquet.write_table method has parameter 'version'. But there is no parameter 'version' for pyarrow.parquet.read_table method. And seems like it only can read parquet files of version…
gs_vlad
  • 1,409
  • 4
  • 15
  • 29
0
votes
1 answer

Converting NaN floats to other types in Parquet format

I currently am processing a bunch of CSV files and transforming them into Parquet. I use these with Hive and query the files directly. I would like to switch over to Dask for my data processing. My data I am reading has optional columns some of…
Eumcoz
  • 2,388
  • 1
  • 21
  • 44
0
votes
1 answer

Unable to connect to HDFS from a worker/data node on a Kerberized cluster using pyarrow' hdfs API

Here's what I'm trying: import pyarrow as pa conf = {"hadoop.security.authentication": "kerberos"} fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_44444", extra_conf=conf) However, when I submit this job to the cluster using Dask-YARN, I get the…
Saurabh
  • 163
  • 9
0
votes
1 answer

Couldn't build egg file for pyarrow

Issue: Couldn't build egg file for pyarrow, tried with 0.12.1 and 0.13 versions of pyarrow. Could you please help me to understand if I miss anything? $ python setup.py bdist_egg Log Trace: running bdist_egg running egg_info writing entry points…
Naga Budigam
  • 689
  • 1
  • 10
  • 26