1

I have a PySpark streaming job which drops duplicate events by a session id. I have a watermarking window of 30 min. Snippet:

unique_df = df.withColumn("timestamp", current_timestamp()).dropDuplicates(session_id).withWatermarking("timestamp", 30)

I have two asks. First, I want to capture all the dropped events due to the above code. I tried using exceptAll and joins but did not work due to the Streaming job nature.

dropped_df = df.except(unique_df)

And

dropped_events = df.join(unique_df, on=session_id, how="left_anti")

Second, let's assume a session_id=abc123 comes at 9:00 and due to some glitch, we see session_id=abc123 at 11:00. Since it is out of the watermarking window, I want to mark this duplicate and capture it separately like above.

Can someone advise how can I achieve this in a streaming env?

boring-coder
  • 63
  • 1
  • 5
  • How about creating a window function, with a row_number on session_id as partition? That way at the end you can filter by row_number, where row_number > 1 would be your dups and and row_number == 1 would be the ones to keep – blade Jul 09 '23 at 00:09

0 Answers0