Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions

votes

1 answer

How to make and efficiently run "void" PySpark user defined function (UDF) that returns nothing?

Given the available methods for specifying user defined functions in PySpark: Row-at-a-time native PySpark UDFs Pandas UDFs that make use of Apache Arrow How could one create and run on a dataframe a user defined function that does not return…

apache-spark pyspark pyarrow

asked Jan 17 '19 at 18:42

Jake Spracher

votes

2 answers

Conversion from big csv to parquet using python error

I have csv file that approximately has 200+ cols and 1mil+ rows. When I am converting from csv to python, i had error: csv_file = 'bigcut.csv' chunksize = 100_000 parquet_file ='output.parquet' …

python pandas csv parquet pyarrow

asked Jan 15 '19 at 06:21

Yesaya

votes

1 answer

Streaming files from a tar file in hdfs

I have tiff images stored in tar files in HDFS. I can download the tar file and stream from it in this way: tar = tarfile.open("filename.tar", 'r|') for tiff in tar: if tiff.isfile(): a = tar.extractfile(tiff).read() na =…

python streaming tarfile pyarrow

asked Dec 31 '18 at 21:15

Ehsan Fathi

votes

1 answer

Parsing schema of pyarrow.parquet.ParquetDataset object

I'm using pyarrow to read parquet data from s3 and I'd like to be able to parse the schema and convert it to a format suitable for running an mLeap serialized model outside of Spark. This requires parsing the schema. If I had a Pyspark dataframe, I…

pyspark pyarrow

asked Dec 20 '18 at 12:35

femibyte

3,317
7
34
59

votes

2 answers

Preserve index when loading pyarrow parquet from pandas DataFrame

I need to convert a dict with dict values to parquet, I have data that look like this: {"KEY":{"2018-12-06":250.0,"2018-12-07":234.0}} I'm converting to pandas dataframe and then writing to pyarrow table: import pandas as pd import pyarrow as…

python pandas dictionary dataframe pyarrow

asked Dec 05 '18 at 21:07

unixeo

1,116
1
15
28

votes

1 answer

Set pants interpreter for Pyarrow

I am using Pants to create .pex file for my project. My Build file has dependency for pyarrow using 3rdparty logic:'3rdparty/python:pyarrow'. Pants build pyarrow using both C++ and Python libraries, I have pyarrow install in anaconda not in standard…

python anaconda pyarrow pants

asked Nov 13 '18 at 15:15

Alexandr Proskurin

votes

0 answers

Merging Parquet Files - Pandas Meta in Schema Mismatch

I am trying to merge multiple parquet files into one. Their schemas are identical field-wise but my ParquetWriter is complaining that they are not. After some investigation I found that the pandas meta in the schemas are different, causing this…

python pandas parquet pyarrow

asked Nov 08 '18 at 18:49

micah

7,596
10
49
90

votes

1 answer

Apache-Drill query parquet file: Error in parquet record reader

I've created a parquet file using Pyarrow and it can be queried using Pyspark. However it cannot be queried using Apache-drill(1.14), which was installed recently and can work with other data formats including csv, json and RDBs. Can someone help me…

parquet apache-drill pyarrow

asked Oct 23 '18 at 14:48

Ray

votes

1 answer

pandas CSV to Parquet data type is not set correctly when column has no values

I'm using pandas data frame read_csv function, and from time to time columns have no values. In this case the data type sent using the dtype parameter is ignored. import pandas as pd df = pd.read_csv("example.csv", dtype={"col1": "str", "col2":…

python pandas csv parquet pyarrow

asked Sep 14 '18 at 15:00

Ori N

votes

1 answer

Reading csv file from hdfs using dask and pyarrow

We are trying out dask_yarn version 0.3.0 (with dask 0.18.2) because of the conflicts between the boost-cpp i'm running with pyarrow version 0.10.0 We are trying to read a csv file from hdfs - however we get an error when running…

dask pyarrow

asked Sep 06 '18 at 13:34

skibee

1,279
1
17
37

votes

1 answer

Unable to read parquet file, giving Gzip code failed error

I am trying to convert parquet to csv file with pyarrow. df = pd.read_parquet('test.parquet') The above code works fine with the sample parquet files downloaded from github. But when I try with the actual large parquet file, it is giving the…

python-3.x pyspark parquet pyarrow fastparquet

asked Aug 13 '18 at 18:20

Pri31

votes

2 answers

Feather.compat import ModuleNotFoundError: No module named 'feather.compat'

Same problem as defined in this post, but there has been no solved answer. Wondering if there is something going on between pyarrow and feather. I tried environments where: I installed with conda install feather-format -c conda-forge Installed…

python pyarrow feather

asked Jul 31 '18 at 20:18

WRosko

votes

0 answers

Parquet file not accesible to write after first read using PyArrow

I am trying to read a parquet file in pandas dataframe, do some manipulation and write it back in the same file, however it seems file is not accessible to write after the first read in same function. It only works, if I don't perform STEP 1…

python parquet pyarrow

asked May 21 '18 at 19:23

SSingh

votes

0 answers

Python: Cann't connect to HDFS files

I have already tried many different ways but non of them does not work. For example, the followng way failed with the error "The system cannot find the file specified." Example: import pyarrow as pa fs = pa.hdfs.connect('192.168.100.45', 20500,…

python hadoop pyarrow

asked Feb 01 '18 at 10:31

AntnR

votes

0 answers

preserving dask dataframe divisions in parquet

When I save a dask dataframe with valid divisions, he divisions are not present when reading back df.divisions # ['a', 'b', 'c', ...] df.to_parquet('frame.pq', engine=engine, write_index=True, compute=True) df2 =…

dataframe parquet dask pyarrow

asked Jan 04 '18 at 22:11

Daniel Mahler

7,653
5
51
90

Prev 1 2 3

…

72 Next