Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions
0
votes
1 answer

How to make and efficiently run "void" PySpark user defined function (UDF) that returns nothing?

Given the available methods for specifying user defined functions in PySpark: Row-at-a-time native PySpark UDFs Pandas UDFs that make use of Apache Arrow How could one create and run on a dataframe a user defined function that does not return…
Jake Spracher
  • 805
  • 7
  • 10
0
votes
2 answers

Conversion from big csv to parquet using python error

I have csv file that approximately has 200+ cols and 1mil+ rows. When I am converting from csv to python, i had error: csv_file = 'bigcut.csv' chunksize = 100_000 parquet_file ='output.parquet' …
Yesaya
  • 21
  • 3
0
votes
1 answer

Streaming files from a tar file in hdfs

I have tiff images stored in tar files in HDFS. I can download the tar file and stream from it in this way: tar = tarfile.open("filename.tar", 'r|') for tiff in tar: if tiff.isfile(): a = tar.extractfile(tiff).read() na =…
Ehsan Fathi
  • 598
  • 5
  • 21
0
votes
1 answer

Parsing schema of pyarrow.parquet.ParquetDataset object

I'm using pyarrow to read parquet data from s3 and I'd like to be able to parse the schema and convert it to a format suitable for running an mLeap serialized model outside of Spark. This requires parsing the schema. If I had a Pyspark dataframe, I…
femibyte
  • 3,317
  • 7
  • 34
  • 59
0
votes
2 answers

Preserve index when loading pyarrow parquet from pandas DataFrame

I need to convert a dict with dict values to parquet, I have data that look like this: {"KEY":{"2018-12-06":250.0,"2018-12-07":234.0}} I'm converting to pandas dataframe and then writing to pyarrow table: import pandas as pd import pyarrow as…
unixeo
  • 1,116
  • 1
  • 15
  • 28
0
votes
1 answer

Set pants interpreter for Pyarrow

I am using Pants to create .pex file for my project. My Build file has dependency for pyarrow using 3rdparty logic:'3rdparty/python:pyarrow'. Pants build pyarrow using both C++ and Python libraries, I have pyarrow install in anaconda not in standard…
Alexandr Proskurin
  • 217
  • 1
  • 2
  • 7
0
votes
0 answers

Merging Parquet Files - Pandas Meta in Schema Mismatch

I am trying to merge multiple parquet files into one. Their schemas are identical field-wise but my ParquetWriter is complaining that they are not. After some investigation I found that the pandas meta in the schemas are different, causing this…
micah
  • 7,596
  • 10
  • 49
  • 90
0
votes
1 answer

Apache-Drill query parquet file: Error in parquet record reader

I've created a parquet file using Pyarrow and it can be queried using Pyspark. However it cannot be queried using Apache-drill(1.14), which was installed recently and can work with other data formats including csv, json and RDBs. Can someone help me…
Ray
  • 7
  • 3
0
votes
1 answer

pandas CSV to Parquet data type is not set correctly when column has no values

I'm using pandas data frame read_csv function, and from time to time columns have no values. In this case the data type sent using the dtype parameter is ignored. import pandas as pd df = pd.read_csv("example.csv", dtype={"col1": "str", "col2":…
Ori N
  • 555
  • 10
  • 22
0
votes
1 answer

Reading csv file from hdfs using dask and pyarrow

We are trying out dask_yarn version 0.3.0 (with dask 0.18.2) because of the conflicts between the boost-cpp i'm running with pyarrow version 0.10.0 We are trying to read a csv file from hdfs - however we get an error when running…
skibee
  • 1,279
  • 1
  • 17
  • 37
0
votes
1 answer

Unable to read parquet file, giving Gzip code failed error

I am trying to convert parquet to csv file with pyarrow. df = pd.read_parquet('test.parquet') The above code works fine with the sample parquet files downloaded from github. But when I try with the actual large parquet file, it is giving the…
Pri31
  • 447
  • 1
  • 5
  • 9
0
votes
2 answers

Feather.compat import ModuleNotFoundError: No module named 'feather.compat'

Same problem as defined in this post, but there has been no solved answer. Wondering if there is something going on between pyarrow and feather. I tried environments where: I installed with conda install feather-format -c conda-forge Installed…
WRosko
  • 73
  • 10
0
votes
0 answers

Parquet file not accesible to write after first read using PyArrow

I am trying to read a parquet file in pandas dataframe, do some manipulation and write it back in the same file, however it seems file is not accessible to write after the first read in same function. It only works, if I don't perform STEP 1…
SSingh
  • 199
  • 2
  • 11
0
votes
0 answers

Python: Cann't connect to HDFS files

I have already tried many different ways but non of them does not work. For example, the followng way failed with the error "The system cannot find the file specified." Example: import pyarrow as pa fs = pa.hdfs.connect('192.168.100.45', 20500,…
AntnR
  • 5
  • 2
0
votes
0 answers

preserving dask dataframe divisions in parquet

When I save a dask dataframe with valid divisions, he divisions are not present when reading back df.divisions # ['a', 'b', 'c', ...] df.to_parquet('frame.pq', engine=engine, write_index=True, compute=True) df2 =…
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
1 2 3
71
72