I have a PySpark streaming job which drops duplicate events by a session id. I have a watermarking window of 30 min. Snippet:
unique_df = df.withColumn("timestamp", current_timestamp()).dropDuplicates(session_id).withWatermarking("timestamp", 30)
I have two asks. First, I want to capture all the dropped events due to the above code. I tried using exceptAll and joins but did not work due to the Streaming job nature.
dropped_df = df.except(unique_df)
And
dropped_events = df.join(unique_df, on=session_id, how="left_anti")
Second, let's assume a session_id=abc123 comes at 9:00 and due to some glitch, we see session_id=abc123 at 11:00. Since it is out of the watermarking window, I want to mark this duplicate and capture it separately like above.
Can someone advise how can I achieve this in a streaming env?