0

I am trying to convert parquet to csv file with pyarrow.

df = pd.read_parquet('test.parquet')

The above code works fine with the sample parquet files downloaded from github.

But when I try with the actual large parquet file, it is giving the following error.

File "_parquet.pyx", line 734, in pyarrow._parquet.ParquetReader.read_all
  File "error.pxi", line 79, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Arrow error: IOError: GZipCodec failed: incorrect header check

I have tried to read the parquet file using fastparquet and pyspark as well. But I am getting similar GZip errors.

I understand that it is the compressed or uncompressed parquet file which is different from the sample one I downloaded.

Please suggest any code or providing any other tool to convert such parquet files to csv files would be of great help. Thanks.

Edit: It seems like these parquet files are in binary format as against the usual string values in parquet. Is there any way to read binary parquet?

Pri31
  • 447
  • 1
  • 5
  • 9

1 Answers1

0

This sounds much like that your Parquet file is broken. PySpark, Arrow, and fastparquet are independent implementations of the Parquet format and thus is most likely not a bug in the reader but a corrupt file.

Without more information (e.g. how this file was written), the only answer is that you will not be able to read it.

Otherwise pd.read_parquet(..).to_csv(..) is enough to convert a Parquet file to CSV.

Uwe L. Korn
  • 8,080
  • 1
  • 30
  • 42
  • Thanks for your reply. I am not sure how the file was written, it was given to me to validate the source data from the files against the data in destination hive tables. If the files were broken, it would not be possible for the developers to load them to destination. But it was loaded successfully in hive and I can see the tables with data. I guess it is the type of parquet file with a different compression type is causing the issue. Is there any way to read compressed parquet files if that was the case? – Pri31 Aug 14 '18 at 05:56
  • This file seems to have a valid Parquet header, only a segment has corrupted GZip data. So the ways you used are the ones that should be the ones to use. Spark uses the same library to read Parquet files as Hive does which makes this issue more complicated. – Uwe L. Korn Aug 14 '18 at 09:19
  • I came to know that these are parquet files in binary format. Is there any way to read these kind of parquet files? – Pri31 Aug 14 '18 at 09:24