0

I am consuming kafka data having "eventtime" (datetime) field in the packet. I want to create hdfs directories in "year/month/day" structure in streaming based on the date part of the eventtime field .

I am using delta-core_2.11:0.6.1, Spark : 2.4 versions

Example :

    /temp/deltalake/data/project_1/2022/12/1
    /temp/deltalake/data/project_1/2022/12/2
    .
    .
    and so on.

The thing I found nearest to my requirement was : partitionBy(Keys) in delta lake documentation.

That will create the data in this format : /temp/deltalake/data/project_1/year=2022/month=12/day=1

data.show() :
+----+-------+-----+-------+---+-------------------+----------+
|S_No|section| Name|   City|Age|          eventtime|      date|
+----+-------+-----+-------+---+-------------------+----------+
|   1|      a|Name1| Indore| 25|2022-02-10 23:30:14|2022-02-10|
|   2|      b|Name2|  Delhi| 25|2021-08-12 10:50:10|2021-08-12|
|   3|      c|Name3| Ranchi| 30|2022-12-10 15:00:00|2022-12-10|
|   4|      d|Name4|Kolkata| 30|2022-05-10 00:30:00|2022-05-10|
|   5|      e|Name5| Mumbai| 30|2022-07-01 10:32:12|2022-07-01|
+----+-------+-----+-------+---+-------------------+----------+



data
  .write
  .format("delta")
  .mode("overwrite")
  .option("mergeSchema", "true")
  .partitionBy(Keys)
  .save("/temp/deltalake/data/project_1/")

But this too didn't work. I referred to this below medium article: https://medium.com/@aravinthR/partitioned-delta-lake-part-3-5cc52b64ebda

Would be great if anyone can help me out in figuring out a possible solution.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132

0 Answers0