PySpark Delete Specific Rows Without Primary Key

Question

I am trying to make up for PySpark's lack of 'WHEN NOT MATCHED BY SOURCE' in MERGE statements. I want to delete rows from a table when there are no matching rows in another table. However, my table doesn't have a primary key, so I'm struggling to specify which rows to delete. I can select the rows I want to delete with the following statement:

SELECT *
FROM global_transations AS target
LEFT ANTI JOIN latest_transactions AS source
  ON source.Period = target.Period
  AND source.Region = target.Region
WHERE target.Period IN (SELECT DISTINCT Period FROM latest_transactions)

However, I don't know how to delete these rows from the global_transations table.

I have tried the following:

DELETE FROM global_transations WHERE EXISTS(
SELECT *
FROM global_transations AS target
LEFT ANTI JOIN latest_transactions AS source
  ON source.Period = target.Period
  AND source.Region = target.Region
WHERE source.Period IN ('Q4-2021', 'Q4-2022')

But this deletes ALL rows in the global_transactions table rather than just the specified rows. Databricks also prevents me from using the 'SELECT DISTINCT Period FROM latest_transactions' in the last line, as this is a sub-query too many for Databricks. This feels like it should be a simple process--what am I missing?

Daniel · Answer 1 · 2022-11-12T13:05:17.393

I figured out a two-step solution which works:

Save the ANTI JOIN query to a temp table anti_join
Perform a second merge on global_transactions which deletes rows when matched with the anti_join table.

The ANTI JOIN saved as temp table:

anti_join = spark.sql("""
  SELECT *
  FROM global_transactions AS target
  LEFT ANTI JOIN latest_transactions AS source
    ON source.Period = target.Period
    AND source.Region = target.Region
  WHERE target.Period IN (SELECT DISTINCT Period FROM latest_transactions)
""")

anti_join.createOrReplaceTempView('anti_join')

The second merge:

%sql
MERGE INTO global_transations AS target
  USING anti_join AS source
      ON target.Period = source.Period
      AND target.Region = source.Region
  WHEN MATCHED THEN DELETE

PySpark Delete Specific Rows Without Primary Key

1 Answers1