Pyspark Hudie writing timestamps as binary

Question

I am trying to write a pyspark DF to s3 hudie parquet format. Evcerything is working fine, however, the timestamps are writing as binary format. I would like to write this as hive tiestamp format so that i can query data in Athena.

Pyspark config as follows.

LOCAL_SPARK_CONF = (
    SparkConf()
    .set(
        "spark.jars.packages",
        "org.apache.hadoop:hadoop-aws:3.2.2,org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.1,org.apache.spark:spark-avro_2.12:3.0.2",
    )
    .set("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider")
    .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .set("spark.sql.hive.convertMetastoreParquet", "false")
)

Hudi options as follows:

 hudi_options = {
            "hoodie.table.name": hudi_table,
            "hoodie.datasource.write.recordkey.field": "hash",
            "hoodie.datasource.write.partitionpath.field": "version, date",
            "hoodie.datasource.write.table.name": hudi_table,
            "hoodie.datasource.hive_sync.support_timestamp": "true",
            "hoodie.parquet.outputtimestamptype": "TIMESTAMP_MILLIS",
            "hoodie.index.type": "GLOBAL_BLOOM",                 # This is required if we want to ensure we upsert a record, even if the partition changes
            "hoodie.bloom.index.update.partition.path": "true",
            "hoodie.datasource.write.operation": "upsert",
            "hoodie.datasource.write.precombine.field": "data_timestamp",
            "hoodie.upsert.shuffle.parallelism": 2,
            "hoodie.insert.shuffle.parallelism": 2,
        }

From reading the documentation "hoodie.datasource.hive_sync.support_timestamp": "true" should maintain hive timestamps and "hoodie.parquet.outputtimestamptype": "TIMESTAMP_MILLIS" should maintain the output format. However, when i subsequently check the data it's a binary timestamp. How can i avoid this?

I write the data as follows:

                df.write.format("org.apache.hudi").options(**hudi_options).mode("overwrite").save(basePath)

I have tried converting the data to a string and writing as string type but it still converts to binary.

df = df.withColumn("data_timestamp", col("data_timestamp").cast(StringType()))

Pyspark Hudie writing timestamps as binary

0 Answers0