I am trying to convert parquet to csv file with pyarrow.
df = pd.read_parquet('test.parquet')
The above code works fine with the sample parquet files downloaded from github.
But when I try with the actual large parquet file, it is giving the following error.
File "_parquet.pyx", line 734, in pyarrow._parquet.ParquetReader.read_all
File "error.pxi", line 79, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Arrow error: IOError: GZipCodec failed: incorrect header check
I have tried to read the parquet file using fastparquet and pyspark as well. But I am getting similar GZip errors.
I understand that it is the compressed or uncompressed parquet file which is different from the sample one I downloaded.
Please suggest any code or providing any other tool to convert such parquet files to csv files would be of great help. Thanks.
Edit: It seems like these parquet files are in binary format as against the usual string values in parquet. Is there any way to read binary parquet?