I have a streaming job that streams data into delta lake in databricks spark, and I'm trying to drop duplicates while streaming so my delta data has no duplications. Here's what I have so far:
inputPath = "my_input_path"
schema = StructType("some_schema")
eventsDF = (
spark
.readStream
.schema(schema)
.option("header", "true")
.option("maxFilesPerTrigger", 1)
.csv(inputPath)
)
def upsertToDelta(eventsDF, batchId):
eventsDF.createOrReplaceTempView("updates")
eventsDF._jdf.sparkSession().sql("""
MERGE INTO eventsDF t
USING updates s
ON s.deviceId = t.deviceId
WHEN NOT MATCHED THEN INSERT *
""")
writePath = "my_write_path"
checkpointPath = writePath + "/_checkpoint"
deltaStreamingQuery = (eventsDF
.writeStream
.format("delta")
.foreachBatch(upsertToDelta)
.option("checkpointLocation", checkpointPath)
.outputMode("append")
.queryName("test")
.start(writePath)
)
I'm getting the error: py4j.protocol.Py4JJavaError: An error occurred while calling o398.sql.
: org.apache.spark.sql.AnalysisException: Table or view not found: eventsDF; line 2 pos 4
But I just started to stream this data and haven't created any table yet.