0

I'm trying to apply some updates to an Iceberg table using pyspark. The original data in the table is:

+-------------------+---+---+
|               time|  A|  B|
+-------------------+---+---+
|2022-12-01 00:00:00|  1|  6|
|2022-12-02 00:00:00|  2|  7|
|2022-12-03 00:00:00|  3|  8|
|2022-12-04 00:00:00|  4|  9|
|2022-12-05 00:00:00|  5| 10|
+-------------------+---+---+

And the update (stored as a temporary view) is:

+-------------------+---+---+
|               time|  A|  C|
+-------------------+---+---+
|2022-12-04 00:00:00| 40| 90|
|2022-12-05 00:00:00| 50|100|
+-------------------+---+---+

I'd like to end up with:

+-------------------+----+---+----+
|               time|   A|  B|   C|
+-------------------+----+---+----+
|2022-12-01 00:00:00|   1|  6| NaN|
|2022-12-02 00:00:00|   2|  7| NaN|
|2022-12-03 00:00:00|   3|  8| NaN|
|2022-12-04 00:00:00|  40|  9|  90|
|2022-12-05 00:00:00|  50| 10| 100|
+-------------------+----+---+----+

As per the docs, I've tried the query:

spark.sql("MERGE INTO db.data d USING update u ON d.time = u.time"
          " WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *")

but it fails because the update doesn't contain column B. Also, even if the update did contain column B, column C wouldn't get added in the result, because it isn't in the original table. Is there anything I can do to get the behaviour I'm after?

Thanks for any help.

Jack
  • 149
  • 1
  • 5

0 Answers0