This is the same question as here, but the accepted answer does not work for me.
Attempt: I try to save a dask dataframe in parquet format and read it with spark.
Issue: the timestamp column can not be interpreted by pyspark
what i have done:
I try to save a Dask dataframe in hfds as parquet using
import dask.dataframe as dd
dd.to_parquet(ddf_param_logs, 'hdfs:///user/<myuser>/<filename>', engine='pyarrow', flavor='spark')
Then I read the file with pyspark:
sdf = spark.read.parquet('hdfs:///user/<myuser>/<filename>')
sdf.show()
>>> org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file hdfs://nameservice1/user/<user>/<filename>/part.0.parquet. Column: [utc_timestamp], Expected: bigint, Found: INT96
but if i save the dataframe with
dd.to_parquet(ddf_param_logs, 'hdfs:///user/<myuser>/<filename>', engine='pyarrow', use_deprecated_int96_timestamps=True)
the utc timestamp column contains the timestamp Information in unix Format (1578642290403000)
this is my Environment:
dask==2.9.0
dask-core==2.9.0
pandas==0.23.4
pyarrow==0.15.1
pyspark==2.4.3