1

I am trying to upsert events from Kafka into a Delta Lake table. I do this with this. New events are coming in fine, values in the delta table are updated based on the merge condition. Now when I stop execution and then rerun the upsert script, Delta Lake seems to not perform an upsert of each row in my streaming df in the same sequence as they were when they came in while the script was already running. The merge function does not recognize a match between the key of the update event and the key which should already exist in the delta table, it just inserts every row, even though the key should already have been inserted by a previous event.

Can anyone explain to me if it is possible to replay events from the beginning as upserts in a delta table? If so, how would you do it?

What I want:

  1. Event with key a comes in, key and value of a gets inserted as a new row
  2. Event with key b comes in, key and value of b gets inserted as a new row
  3. Event with key a comes in, value of a gets updated

What happends when I restart reading the stream and Delta Lake merge function:

  1. Event with key a comes in, key and value of a gets inserted as a new row
  2. Event with key b comes in, key and value of b gets inserted as a new row
  3. Event with key a comes in, key and value of a gets inserted as a new row

What I hoped to happen: writeStream with forEachBatch starting with first earliest micro-batch from Kafka, then upsertToDelta begins inserting rows in micro-batch sequentially from the start

My code:

from delta.tables import *
from pyspark.sql.functions import *
from pyspark.sql.functions import udf
import json

df = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "[HOST]") \
    .option("subscribe", "[topic]") \
    .option("includeHeaders", "true") \
    .option("startingOffsets", "earliest") \
    [...].load()

dfgrouped = df.selectExpr("CAST(value AS STRING)")
records = (dfgrouped.withColumn("my_key", my_udf(dfgrouped["value"])))
# my_udf is a custom function to get a key based in the row value

deltaTable = DeltaTable.forName(spark, "mydeltable")

@udf
def get_updated_value(my_key, update_value, events_value):
    [...]
    return blob

@udf
def get_new_value(my_key, update_value):
    [...]
    return blob
    
def upsertToDelta(updatesDF, id):
    deltaTable.alias("events") \
    .merge(
        source = updatesDF.alias("updates"),
        condition = expr("events.my_key = updates.my_key") # It does not detect events.my_key for updates
    ) \
    .whenMatchedUpdate(set =
        {
        "value": get_updated_value(col("updates.my_key"), col("updates.value"), col("events.value"))
        }
    ) \
    .whenNotMatchedInsert(values =
        {
        "my_key": col("updates.my_key"),
        "value": get_new_value(col("updates.my_key"), col("updates.value"))
        }
    ) \
    .execute()
    
records.writeStream \
    .format("kafka") \
    .foreachBatch(upsertToDelta) \
    .outputMode("update") \
    .option("checkpointLocation", "/delta/events/_checkpoints/[CHECKPOINT]") \
    .option("kafka.bootstrap.servers", "[HOST]") \
    .option("topic", "[SINK_TOPIC]") \
    [...].start()    
Databash
  • 76
  • 5

1 Answers1

0

Delta lake only supports Append Only or Complete modes when acting as streaming sink (i.e. either append all records to the table or replace the whole table) https://docs.delta.io/latest/delta-streaming.html#delta-table-as-a-sink

Alex Michel
  • 416
  • 3
  • 13