I am trying to make up for PySpark's lack of 'WHEN NOT MATCHED BY SOURCE' in MERGE statements. I want to delete rows from a table when there are no matching rows in another table. However, my table doesn't have a primary key, so I'm struggling to specify which rows to delete. I can select the rows I want to delete with the following statement:
SELECT *
FROM global_transations AS target
LEFT ANTI JOIN latest_transactions AS source
ON source.Period = target.Period
AND source.Region = target.Region
WHERE target.Period IN (SELECT DISTINCT Period FROM latest_transactions)
However, I don't know how to delete these rows from the global_transations table.
I have tried the following:
DELETE FROM global_transations WHERE EXISTS(
SELECT *
FROM global_transations AS target
LEFT ANTI JOIN latest_transactions AS source
ON source.Period = target.Period
AND source.Region = target.Region
WHERE source.Period IN ('Q4-2021', 'Q4-2022')
But this deletes ALL rows in the global_transactions table rather than just the specified rows. Databricks also prevents me from using the 'SELECT DISTINCT Period FROM latest_transactions' in the last line, as this is a sub-query too many for Databricks. This feels like it should be a simple process--what am I missing?