I am using Apache Hudi version 0.12.0 in AWS Glue Version 4.0. I am trying to get my table to have partitions by month and year, and I cannot get this to work.
Here is the code in my Glue Job:
base_s3_path = "s3://my_bucket/Path/To/Files"
database_name1 = "my_database"
source_table = "my_source_table"
table_name = 'my_target_table'
final_base_path = "{base_s3_path}/{table_name}".format(
base_s3_path=base_s3_path, table_name=table_name
)
spark = SparkSession.builder.config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer').getOrCreate()
sc = spark.sparkContext
glueContext = GlueContext(sc)
hudi_options = {
"hoodie.insert.shuffle.parallelism": "2",
"hoodie.upsert.shuffle.parallelism": "2",
"hoodie.datasource.write.precombine.field": "ts",
"hoodie.datasource.write.recordkey.field": "myrecordkeycolumn",
"hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.TimestampBasedKeyGenerator",
"hoodie.deltastreamer.keygen.timebased.timestamp.type": "UNIX_TIMESTAMP",
"hoodie.deltastreamer.keygen.timebased.timezone": "GMT+8:00",
"hoodie.deltastreamer.keygen.timebased.output.dateformat": "yyyy-MM",
"hoodie.datasource.write.partitionpath.field": "mydatefield",
"hoodie.table.name": table_name
}
client = boto3.client('glue')
source_table_path = get_table_s3_path(database_name=database_name1, table_name=source_table)
df = glueContext.create_dynamic_frame.from_options(connection_type='s3',
connection_options={
"paths": ['s3://my_bucket/Path/To/Soucre/Files']
},
format="parquet").toDF()
df = df.withColumn("ts", current_timestamp())
startTime = time.time()
df.write.format("hudi").options(**hudi_options).mode("overwrite").save(final_base_path)
mydatefield is a timestamp column that looks like this: 2019-11-13 00:00:00.000 I am trying to partition based on this field.
I do not get any errors when I run the code but the partitioning does not look right in s3. The paths look like this:
s3://my_bucket/Path/To/Files/my_target_table/35433224-07-30/
s3://my_bucket/Path/To/Files/my_target_table/36862412-01-14/
I want them to look like this:
s3://my_bucket/Path/To/Files/my_target_table/2019-11
s3://my_bucket/Path/To/Files/my_target_table/2019-12
What seems to be happening is Hudi is turning my column: "mydatefield" into a bigint. When I read this column into the spark dataframe I double checked and it does read it in as a timestamp, but on the write it is converting it to a bigint.
I tried setting "mydatefield" to a timestamp in "my_target_table" in the Glue Catalog but this did not change the behavior of the partitioning to how I want.
I see in the apache hudi documentation there is this hoodie.table.create.schema https://hudi.apache.org/docs/0.11.1/configurations/#hoodietablecreateschema but I cannot find any additional information on how to use it. I'd like to try and define "mydatefield" as a timestamp using this.
Has anyone else ever ran into this? Any Ideas?