I am having a piece of scala code which will take count of signals at 3 different stages with respect to an id_no and an identifier. The output of the code will be as shown below.
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|id_no|identifier|signal01_total|signal01_without_NaN|signal01_total_valid|signal02_total|signal02_without_NaN|signal02_total_valid|signal03_total|signal03_without_NaN|signal03_total_valid|load_timestamp |
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|050 |ident01 |25 |23 |20 |45 |43 |40 |66 |60 |55 |2021-08-10T16:58:30.054+0000|
|051 |ident01 |78 |70 |68 |15 |14 |14 |10 |10 |9 |2021-08-10T16:58:30.054+0000|
|052 |ident01 |88 |88 |86 |75 |73 |70 |16 |13 |13 |2021-08-10T16:58:30.054+0000|
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
There will be more than 100 signals, so that number of columns will be more than 300.
This dataframe is written to the delta table location as shown below.
statisticsDf.write.format("delta").option("mergeSchema", "true").mode("append").partitionBy("id_no").save(statsDestFolderPath)
For the next week data i have again executed this code and get the data as shown below.
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|id_no|identifier|signal01_total|signal01_without_NaN|signal01_total_valid|signal02_total|signal02_without_NaN|signal02_total_valid|signal03_total|signal03_without_NaN|signal03_total_valid|load_timestamp |
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|050 |ident01 |10 |8 |7 |15 |15 |14 |38 |38 |37 |2021-08-10T16:58:30.054+0000|
|051 |ident01 |10 |10 |9 |16 |15 |15 |30 |30 |30 |2021-08-10T16:58:30.054+0000|
|052 |ident01 |26 |24 |24 |24 |23 |23 |40 |38 |36 |2021-08-10T16:58:30.054+0000|
|053 |ident01 |25 |24 |23 |20 |19 |19 |25 |25 |24 |2021-08-10T16:58:30.054+0000|
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
But the output I expect is if the id_no ,identifier and signal name is already present in the table, then it should add the count with existing data, If the id_no, identifier and signal name are new, then it should add to the final table.
The output I receive now is as shown below, where data gets appended for each run.
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|id_no|identifier|signal01_total|signal01_without_NaN|signal01_total_valid|signal02_total|signal02_without_NaN|signal02_total_valid|signal03_total|signal03_without_NaN|signal03_total_valid|load_timestamp |
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|050 |ident01 |25 |23 |20 |45 |43 |40 |66 |60 |55 |2021-08-10T16:58:30.054+0000|
|051 |ident01 |78 |70 |68 |15 |14 |14 |10 |10 |9 |2021-08-10T16:58:30.054+0000|
|052 |ident01 |88 |88 |86 |75 |73 |70 |16 |13 |13 |2021-08-10T16:58:30.054+0000|
|050 |ident01 |10 |8 |7 |15 |15 |14 |38 |38 |37 |2021-08-10T16:58:30.054+0000|
|051 |ident01 |10 |10 |9 |16 |15 |15 |30 |30 |30 |2021-08-10T16:58:30.054+0000|
|052 |ident01 |26 |24 |24 |24 |23 |23 |40 |38 |36 |2021-08-10T16:58:30.054+0000|
|053 |ident01 |25 |24 |23 |20 |19 |19 |25 |25 |24 |2021-08-10T16:58:30.054+0000|
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
But I am expecting the output as shown below.
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|id_no|identifier|signal01_total|signal01_without_NaN|signal01_total_valid|signal02_total|signal02_without_NaN|signal02_total_valid|signal03_total|signal03_without_NaN|signal03_total_valid|load_timestamp |
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|050 |ident01 |35 |31 |27 |60 |58 |54 |38 |38 |37 |2021-08-10T16:58:30.054+0000|
|051 |ident01 |88 |80 |77 |31 |29 |19 |30 |30 |30 |2021-08-10T16:58:30.054+0000|
|052 |ident01 |114 |102 |110 |99 |96 |93 |40 |38 |36 |2021-08-10T16:58:30.054+0000|
|053 |ident01 |25 |24 |23 |20 |19 |19 |25 |25 |24 |2021-08-10T16:58:30.054+0000|
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
Got a hint using upsert command as below.
val updatesDF = ... // define the updates DataFrame[id_no, identifier, sig01_total, sig01_NaN, sig01_final, sig02_total,.......]
DeltaTable.forPath(spark, "/data/events/")
.as("events")
.merge(
updatesDF.as("updates"),
"events.id_no = updates.id_no" &&
"events.identifier = updates.identifier")
.whenMatched
.updateExpr(
Map("sig01_total" -> "updates.sig01_total"
->
->........))
.whenNotMatched
.insertExpr(
Map(
"id_no" -> "updates.id_no",
"identifier" -> "updates.identifier",
"sig01_total" -> "updates.sig01_total"
->
->
.....))
.execute()
But in my case the number of columns may vary each time, if a new signal is added to the id, then we have to add the same. If one of the signal for existing id is not available for current week process, that signal value alone should keep same and rest should be updated.
Is there any option to achieve this requirement using delta table merge or by updating the above code or any other ways? Any leads appreciated!