I am consuming kafka data having "eventtime" (datetime) field in the packet. I want to create hdfs directories in "year/month/day" structure in streaming based on the date part of the eventtime field .
I am using delta-core_2.11:0.6.1, Spark : 2.4 versions
Example :
/temp/deltalake/data/project_1/2022/12/1
/temp/deltalake/data/project_1/2022/12/2
.
.
and so on.
The thing I found nearest to my requirement was : partitionBy(Keys) in delta lake documentation.
That will create the data in this format : /temp/deltalake/data/project_1/year=2022/month=12/day=1
data.show() :
+----+-------+-----+-------+---+-------------------+----------+
|S_No|section| Name| City|Age| eventtime| date|
+----+-------+-----+-------+---+-------------------+----------+
| 1| a|Name1| Indore| 25|2022-02-10 23:30:14|2022-02-10|
| 2| b|Name2| Delhi| 25|2021-08-12 10:50:10|2021-08-12|
| 3| c|Name3| Ranchi| 30|2022-12-10 15:00:00|2022-12-10|
| 4| d|Name4|Kolkata| 30|2022-05-10 00:30:00|2022-05-10|
| 5| e|Name5| Mumbai| 30|2022-07-01 10:32:12|2022-07-01|
+----+-------+-----+-------+---+-------------------+----------+
data
.write
.format("delta")
.mode("overwrite")
.option("mergeSchema", "true")
.partitionBy(Keys)
.save("/temp/deltalake/data/project_1/")
But this too didn't work. I referred to this below medium article: https://medium.com/@aravinthR/partitioned-delta-lake-part-3-5cc52b64ebda
Would be great if anyone can help me out in figuring out a possible solution.