0

I am having a piece of scala code which will take count of signals at 3 different stages with respect to an id_no and an identifier. The output of the code will be as shown below.

+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|id_no|identifier|signal01_total|signal01_without_NaN|signal01_total_valid|signal02_total|signal02_without_NaN|signal02_total_valid|signal03_total|signal03_without_NaN|signal03_total_valid|load_timestamp              |
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|050  |ident01   |25            |23                  |20                  |45            |43                  |40                  |66            |60                  |55                  |2021-08-10T16:58:30.054+0000|
|051  |ident01   |78            |70                  |68                  |15            |14                  |14                  |10            |10                  |9                   |2021-08-10T16:58:30.054+0000|
|052  |ident01   |88            |88                  |86                  |75            |73                  |70                  |16            |13                  |13                  |2021-08-10T16:58:30.054+0000|
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+

There will be more than 100 signals, so that number of columns will be more than 300.

This dataframe is written to the delta table location as shown below.

statisticsDf.write.format("delta").option("mergeSchema", "true").mode("append").partitionBy("id_no").save(statsDestFolderPath)

For the next week data i have again executed this code and get the data as shown below.

+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|id_no|identifier|signal01_total|signal01_without_NaN|signal01_total_valid|signal02_total|signal02_without_NaN|signal02_total_valid|signal03_total|signal03_without_NaN|signal03_total_valid|load_timestamp              |
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|050  |ident01   |10            |8                   |7                   |15            |15                  |14                  |38            |38                  |37                  |2021-08-10T16:58:30.054+0000|
|051  |ident01   |10            |10                  |9                   |16            |15                  |15                  |30            |30                  |30                  |2021-08-10T16:58:30.054+0000|
|052  |ident01   |26            |24                  |24                  |24            |23                  |23                  |40            |38                  |36                  |2021-08-10T16:58:30.054+0000|
|053  |ident01   |25            |24                  |23                  |20            |19                  |19                  |25            |25                  |24                  |2021-08-10T16:58:30.054+0000|
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+

But the output I expect is if the id_no ,identifier and signal name is already present in the table, then it should add the count with existing data, If the id_no, identifier and signal name are new, then it should add to the final table.

The output I receive now is as shown below, where data gets appended for each run.

+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|id_no|identifier|signal01_total|signal01_without_NaN|signal01_total_valid|signal02_total|signal02_without_NaN|signal02_total_valid|signal03_total|signal03_without_NaN|signal03_total_valid|load_timestamp              |
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|050  |ident01   |25            |23                  |20                  |45            |43                  |40                  |66            |60                  |55                  |2021-08-10T16:58:30.054+0000|
|051  |ident01   |78            |70                  |68                  |15            |14                  |14                  |10            |10                  |9                   |2021-08-10T16:58:30.054+0000|
|052  |ident01   |88            |88                  |86                  |75            |73                  |70                  |16            |13                  |13                  |2021-08-10T16:58:30.054+0000|
|050  |ident01   |10            |8                   |7                   |15            |15                  |14                  |38            |38                  |37                  |2021-08-10T16:58:30.054+0000|
|051  |ident01   |10            |10                  |9                   |16            |15                  |15                  |30            |30                  |30                  |2021-08-10T16:58:30.054+0000|
|052  |ident01   |26            |24                  |24                  |24            |23                  |23                  |40            |38                  |36                  |2021-08-10T16:58:30.054+0000|
|053  |ident01   |25            |24                  |23                  |20            |19                  |19                  |25            |25                  |24                  |2021-08-10T16:58:30.054+0000|
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+

But I am expecting the output as shown below.

+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|id_no|identifier|signal01_total|signal01_without_NaN|signal01_total_valid|signal02_total|signal02_without_NaN|signal02_total_valid|signal03_total|signal03_without_NaN|signal03_total_valid|load_timestamp              |
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|050  |ident01   |35            |31                  |27                  |60            |58                  |54                  |38            |38                  |37                  |2021-08-10T16:58:30.054+0000|
|051  |ident01   |88            |80                  |77                  |31            |29                  |19                  |30            |30                  |30                  |2021-08-10T16:58:30.054+0000|
|052  |ident01   |114           |102                 |110                 |99            |96                  |93                  |40            |38                  |36                  |2021-08-10T16:58:30.054+0000|
|053  |ident01   |25            |24                  |23                  |20            |19                  |19                  |25            |25                  |24                  |2021-08-10T16:58:30.054+0000|
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+

Got a hint using upsert command as below.

val updatesDF = ...  // define the updates DataFrame[id_no, identifier, sig01_total, sig01_NaN, sig01_final, sig02_total,.......]

DeltaTable.forPath(spark, "/data/events/")
  .as("events")
  .merge(
    updatesDF.as("updates"),
    "events.id_no = updates.id_no" &&
    "events.identifier = updates.identifier")
  .whenMatched
  .updateExpr(
    Map("sig01_total" -> "updates.sig01_total"
                      -> 
                      ->........))
  .whenNotMatched
  .insertExpr(
    Map(
      "id_no" -> "updates.id_no",
      "identifier" -> "updates.identifier",
      "sig01_total" -> "updates.sig01_total"
                    ->
                    ->
                     .....))
  .execute()

But in my case the number of columns may vary each time, if a new signal is added to the id, then we have to add the same. If one of the signal for existing id is not available for current week process, that signal value alone should keep same and rest should be updated.

Is there any option to achieve this requirement using delta table merge or by updating the above code or any other ways? Any leads appreciated!

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Antony
  • 970
  • 3
  • 20
  • 46

1 Answers1

0

Use case mentioned in Question, needs an upsert operation.

You can use Databricks documentation for upsert operation where you can write a logic to perform upsert operation.

You can control when to insert and when to update based on expression.

Reference link https://docs.databricks.com/delta/delta-update.html#upsert-into-a-table-using-merge

Ramdev Sharma
  • 974
  • 1
  • 12
  • 17
  • yes this should work, for me except first 2 columns rest of the columns are not static, those column numbers varies for each run. How should then we implement this logic there? – Antony Aug 12 '21 at 08:12
  • you needs to prepare dynamically by looking into schema of updatesDF and existing delta table and generate map based on columns. – Ramdev Sharma Aug 12 '21 at 08:35