1

We are processing a stream of web logs. Basically activities that users perform on website. For each activity they perform, we a separate activity delta table.

We are exploring what is the best way to do streaming ingest. We have a kafka stream setup where all the activities are ingested in following format. But depending on the activity, we need to decide the different target table for the event to store.

{
   activity_name: "Purchased"
   data: {
      product: "Soap",
      amount: 1200
   }
}

Can you help with what is the best way to handle this scenario?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Ravi Patel
  • 121
  • 2
  • 6

1 Answers1

1

This is call multiplexing. Usually the solution is to use structured streaming with .foreachBatch function as a sink, and then inside that function write data for each of the possible values of activity_name.

Something like this (for example, as it's shown in this blog post):

activity_names = ["Purchased", ...]
app_name = "my_app_name"

def write_tables(df, epoch):
  df.cache()
  for n in activity_names:
    df.filter(f"activity_name = '{n}'") \
      .write.mode("append") \
      .option("txnVersion", epoch) \
      .option("txnAppId", app_name) \
      .save(...)
    
  df.unpersist()


stread_df.writeStream \
  .foreachBatch(write_tables) \
  .option("checkpointLocation", "some_path") \
  .start()

Please note that we're using idempotent writes for Delta tables to avoid duplicates if microbatch is restarted in the middle of execution.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Thank you for sharing this. Quick question, can the "txnVersion" and "txnAppId" come from the dataframe columns being appended? – bda Feb 09 '23 at 22:47
  • 1
    Theoretically yes, but you will need to do loops, etc. – Alex Ott Feb 09 '23 at 23:33
  • I see, I am guessing there is no simple way to take the "txnVersion" and "txnAppId" directly from the designated columns in the dataframe that is being appended at the time of the append? P.S. I like the purpose the idempotent writes feature serves, but it's difficult to use it when appending a large batch of rows. – bda Feb 10 '23 at 02:56
  • 1
    One batch is one transaction with the same tnxVersion… it either succeed or fail. Why do you need it applied on the row level? – Alex Ott Feb 10 '23 at 03:19
  • Good question. It prompts me to think more now. Thank you. – bda Feb 10 '23 at 18:28
  • is it possible to use "txnVersion"/"txnAppId" with standard dataframe batch .write, not only with .writeStream? In my use case I am not writing from a streaming source, but from a set of files read once... – bda Feb 14 '23 at 16:19
  • 1
    Why not? Although not sure about use case – Alex Ott Feb 14 '23 at 18:26
  • My use case is to avoid appending the same data once (from the input files) in case the prior append completed partially and failed for whatever reason. – bda Feb 14 '23 at 20:42
  • You can try to use copy into command – Alex Ott Feb 14 '23 at 21:18