0

I am using Apache Hudi version 0.12.0 in AWS Glue Version 4.0. I am trying to get my table to have partitions by month and year, and I cannot get this to work.

Here is the code in my Glue Job:

base_s3_path = "s3://my_bucket/Path/To/Files"
database_name1 = "my_database"
source_table = "my_source_table"
table_name = 'my_target_table'

final_base_path = "{base_s3_path}/{table_name}".format(
    base_s3_path=base_s3_path, table_name=table_name
)

spark = SparkSession.builder.config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer').getOrCreate()
sc = spark.sparkContext
glueContext = GlueContext(sc)

hudi_options = {
    "hoodie.insert.shuffle.parallelism": "2",
    "hoodie.upsert.shuffle.parallelism": "2",
    "hoodie.datasource.write.precombine.field": "ts",
    "hoodie.datasource.write.recordkey.field": "myrecordkeycolumn",
    "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.TimestampBasedKeyGenerator",
    "hoodie.deltastreamer.keygen.timebased.timestamp.type": "UNIX_TIMESTAMP",
    "hoodie.deltastreamer.keygen.timebased.timezone": "GMT+8:00",
    "hoodie.deltastreamer.keygen.timebased.output.dateformat": "yyyy-MM",
    "hoodie.datasource.write.partitionpath.field": "mydatefield",
    "hoodie.table.name": table_name
}

client = boto3.client('glue')

source_table_path = get_table_s3_path(database_name=database_name1, table_name=source_table)
df = glueContext.create_dynamic_frame.from_options(connection_type='s3',
                                                   connection_options={
                                                       "paths": ['s3://my_bucket/Path/To/Soucre/Files']
                                                   },
                                                   format="parquet").toDF()

df = df.withColumn("ts", current_timestamp())
startTime = time.time()
df.write.format("hudi").options(**hudi_options).mode("overwrite").save(final_base_path)

mydatefield is a timestamp column that looks like this: 2019-11-13 00:00:00.000 I am trying to partition based on this field.

I do not get any errors when I run the code but the partitioning does not look right in s3. The paths look like this:

s3://my_bucket/Path/To/Files/my_target_table/35433224-07-30/
s3://my_bucket/Path/To/Files/my_target_table/36862412-01-14/

I want them to look like this:

s3://my_bucket/Path/To/Files/my_target_table/2019-11
s3://my_bucket/Path/To/Files/my_target_table/2019-12

What seems to be happening is Hudi is turning my column: "mydatefield" into a bigint. When I read this column into the spark dataframe I double checked and it does read it in as a timestamp, but on the write it is converting it to a bigint.

I tried setting "mydatefield" to a timestamp in "my_target_table" in the Glue Catalog but this did not change the behavior of the partitioning to how I want.

I see in the apache hudi documentation there is this hoodie.table.create.schema https://hudi.apache.org/docs/0.11.1/configurations/#hoodietablecreateschema but I cannot find any additional information on how to use it. I'd like to try and define "mydatefield" as a timestamp using this.

Has anyone else ever ran into this? Any Ideas?

cjf280830
  • 3
  • 3
  • Why not adding a column to your DF with the formatting you want and specify it as a hudibpartition column ? See https://hudi.apache.org/docs/0.11.1/configurations/#hoodietablepartitionfields – parisni Jun 18 '23 at 21:39
  • @parisni Thank you for the response yes I will likely end up needing to do this, but it just seems like this functionality should be working. I am following what is in this article almost exactly: https://medium.com/@simpsons/primary-key-and-partition-generators-with-apache-hudi-f0e4d71d9d26 – cjf280830 Jun 19 '23 at 12:33

0 Answers0