Parquet compatibility with Dask/Pandas and Pyspark

Question

This is the same question as here, but the accepted answer does not work for me.

Attempt: I try to save a dask dataframe in parquet format and read it with spark.

Issue: the timestamp column can not be interpreted by pyspark

what i have done:

I try to save a Dask dataframe in hfds as parquet using

import dask.dataframe as dd
dd.to_parquet(ddf_param_logs, 'hdfs:///user/<myuser>/<filename>', engine='pyarrow', flavor='spark')

Then I read the file with pyspark:

sdf = spark.read.parquet('hdfs:///user/<myuser>/<filename>')
sdf.show()

>>>  org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file hdfs://nameservice1/user/<user>/<filename>/part.0.parquet. Column: [utc_timestamp], Expected: bigint, Found: INT96

but if i save the dataframe with

dd.to_parquet(ddf_param_logs, 'hdfs:///user/<myuser>/<filename>', engine='pyarrow', use_deprecated_int96_timestamps=True)

the utc timestamp column contains the timestamp Information in unix Format (1578642290403000)

this is my Environment:

dask==2.9.0
dask-core==2.9.0
pandas==0.23.4
pyarrow==0.15.1
pyspark==2.4.3

score 4 · Accepted Answer · answered Jan 28 '20 at 13:28

4

The INT96 type was explicitly included in order to allow compatibility with spark, which chose not to use the standard time type defined by the parquet spec. Unfortunately, it seems that they have changed again, and no longer use their own previous standard, not the parquet one.

If you could find out what type spark wants here, and post an issue to the dask repo, it would be appreciated. You would want to output data from spark containing time columns, and see what format it ends up as.

Did you also try the fastparquet backend?

answered Jan 28 '20 at 13:28

mdurant

27,272
5
45
74

using fastparquet did the trick! Special thanks! This is the snippet: `dd.to_parquet(ddf_param_logs, , engine='fastparquet', times='int96` – b0lle Feb 11 '20 at 13:32
Edit, it was probably hadoop/m-r/hive that first chose to use int96, not spark – mdurant Feb 11 '20 at 14:58

Parquet compatibility with Dask/Pandas and Pyspark

1 Answers1