0

I have a streaming job that streams data into delta lake in databricks spark, and I'm trying to drop duplicates while streaming so my delta data has no duplications. Here's what I have so far:

inputPath = "my_input_path"

schema = StructType("some_schema")

eventsDF = (
  spark
    .readStream
    .schema(schema)
    .option("header", "true")
    .option("maxFilesPerTrigger", 1)
    .csv(inputPath)
)

def upsertToDelta(eventsDF, batchId): 
  eventsDF.createOrReplaceTempView("updates")

  eventsDF._jdf.sparkSession().sql("""
    MERGE INTO eventsDF t
    USING updates s
    ON s.deviceId = t.deviceId
    WHEN NOT MATCHED THEN INSERT *
  """)


writePath = "my_write_path"
checkpointPath = writePath + "/_checkpoint"

deltaStreamingQuery = (eventsDF
  .writeStream
  .format("delta")
  .foreachBatch(upsertToDelta)
  .option("checkpointLocation", checkpointPath)
  .outputMode("append")
  .queryName("test")
  .start(writePath)
)

I'm getting the error: py4j.protocol.Py4JJavaError: An error occurred while calling o398.sql. : org.apache.spark.sql.AnalysisException: Table or view not found: eventsDF; line 2 pos 4

But I just started to stream this data and haven't created any table yet.

efsee
  • 579
  • 1
  • 10
  • 22

1 Answers1

0

I find 2 issues with your code:

  1. As you are calling Merge statement in function upsertToDelta. Spark is expecting a target table with which the "updates" tempView can be merged.

In the code:

MERGE INTO eventsDF t USING updates s ON s.deviceId = t.deviceId WHEN NOT MATCHED THEN INSERT *

eventsDF is suppose to be the target table name.

  1. The above code itself will join the tempView with the target table and it will insert into the target table, when not matched.

Therefore, writePath in start() is not required. .start(writePath)

Please note: You can also add update option to your merge code, if required.