How to drop duplicates while streaming in spark

Question

I have a streaming job that streams data into delta lake in databricks spark, and I'm trying to drop duplicates while streaming so my delta data has no duplications. Here's what I have so far:

inputPath = "my_input_path"

schema = StructType("some_schema")

eventsDF = (
  spark
    .readStream
    .schema(schema)
    .option("header", "true")
    .option("maxFilesPerTrigger", 1)
    .csv(inputPath)
)

def upsertToDelta(eventsDF, batchId): 
  eventsDF.createOrReplaceTempView("updates")

  eventsDF._jdf.sparkSession().sql("""
    MERGE INTO eventsDF t
    USING updates s
    ON s.deviceId = t.deviceId
    WHEN NOT MATCHED THEN INSERT *
  """)


writePath = "my_write_path"
checkpointPath = writePath + "/_checkpoint"

deltaStreamingQuery = (eventsDF
  .writeStream
  .format("delta")
  .foreachBatch(upsertToDelta)
  .option("checkpointLocation", checkpointPath)
  .outputMode("append")
  .queryName("test")
  .start(writePath)
)

I'm getting the error: py4j.protocol.Py4JJavaError: An error occurred while calling o398.sql. : org.apache.spark.sql.AnalysisException: Table or view not found: eventsDF; line 2 pos 4

But I just started to stream this data and haven't created any table yet.

What's your question? Are you asking about the table not found or how to drop duplicates? — Jacek Laskowski, Sep 17 '19 at 09:29
@JacekLaskowski, I'm trying to drop duplicates while streaming data, my table is constantly having new data, can just perform dropDuplicates once. — efsee, Sep 17 '19 at 20:54
You'll be dropping duplicates as they come until watermark. `Dataset` is a streaming query (not a structured one-off batch query). Have you tried it out? — Jacek Laskowski, Sep 19 '19 at 11:59
Assuming you want to drop duplicates based on deviceId, have you tried adding ```.dropDuplicates("deviceId")``` while reading the stream? — darekarsam, Jan 28 '20 at 22:39

score 0 · Answer 1 · answered May 19 '21 at 19:41

I find 2 issues with your code:

As you are calling Merge statement in function upsertToDelta. Spark is expecting a target table with which the "updates" tempView can be merged.

In the code:

MERGE INTO eventsDF t USING updates s ON s.deviceId = t.deviceId WHEN NOT MATCHED THEN INSERT *

eventsDF is suppose to be the target table name.

The above code itself will join the tempView with the target table and it will insert into the target table, when not matched.

Therefore, writePath in start() is not required. .start(writePath)

Please note: You can also add update option to your merge code, if required.

How to drop duplicates while streaming in spark

1 Answers1