I'm trying to apply some updates to an Iceberg table using pyspark. The original data in the table is:
+-------------------+---+---+
| time| A| B|
+-------------------+---+---+
|2022-12-01 00:00:00| 1| 6|
|2022-12-02 00:00:00| 2| 7|
|2022-12-03 00:00:00| 3| 8|
|2022-12-04 00:00:00| 4| 9|
|2022-12-05 00:00:00| 5| 10|
+-------------------+---+---+
And the update (stored as a temporary view) is:
+-------------------+---+---+
| time| A| C|
+-------------------+---+---+
|2022-12-04 00:00:00| 40| 90|
|2022-12-05 00:00:00| 50|100|
+-------------------+---+---+
I'd like to end up with:
+-------------------+----+---+----+
| time| A| B| C|
+-------------------+----+---+----+
|2022-12-01 00:00:00| 1| 6| NaN|
|2022-12-02 00:00:00| 2| 7| NaN|
|2022-12-03 00:00:00| 3| 8| NaN|
|2022-12-04 00:00:00| 40| 9| 90|
|2022-12-05 00:00:00| 50| 10| 100|
+-------------------+----+---+----+
As per the docs, I've tried the query:
spark.sql("MERGE INTO db.data d USING update u ON d.time = u.time"
" WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *")
but it fails because the update doesn't contain column B
. Also, even if the update did contain column B
, column C
wouldn't get added in the result, because it isn't in the original table. Is there anything I can do to get the behaviour I'm after?
Thanks for any help.