I am trying to write a pyspark DF to s3 hudie parquet format. Evcerything is working fine, however, the timestamps are writing as binary format. I would like to write this as hive tiestamp format so that i can query data in Athena.
Pyspark config as follows.
LOCAL_SPARK_CONF = (
SparkConf()
.set(
"spark.jars.packages",
"org.apache.hadoop:hadoop-aws:3.2.2,org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.1,org.apache.spark:spark-avro_2.12:3.0.2",
)
.set("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.sql.hive.convertMetastoreParquet", "false")
)
Hudi options as follows:
hudi_options = {
"hoodie.table.name": hudi_table,
"hoodie.datasource.write.recordkey.field": "hash",
"hoodie.datasource.write.partitionpath.field": "version, date",
"hoodie.datasource.write.table.name": hudi_table,
"hoodie.datasource.hive_sync.support_timestamp": "true",
"hoodie.parquet.outputtimestamptype": "TIMESTAMP_MILLIS",
"hoodie.index.type": "GLOBAL_BLOOM", # This is required if we want to ensure we upsert a record, even if the partition changes
"hoodie.bloom.index.update.partition.path": "true",
"hoodie.datasource.write.operation": "upsert",
"hoodie.datasource.write.precombine.field": "data_timestamp",
"hoodie.upsert.shuffle.parallelism": 2,
"hoodie.insert.shuffle.parallelism": 2,
}
From reading the documentation "hoodie.datasource.hive_sync.support_timestamp": "true"
should maintain hive timestamps and "hoodie.parquet.outputtimestamptype": "TIMESTAMP_MILLIS"
should maintain the output format. However, when i subsequently check the data it's a binary timestamp. How can i avoid this?
I write the data as follows:
df.write.format("org.apache.hudi").options(**hudi_options).mode("overwrite").save(basePath)
I have tried converting the data to a string and writing as string type but it still converts to binary.
df = df.withColumn("data_timestamp", col("data_timestamp").cast(StringType()))