0

We have kafka topics to processed everyday as a batch process. When the pipeline is triggered, initially we load the kafka data as is to an ADLS location (Landing).

The data recieved on Kafka is all CDC data which will have KEY, VALUE, OFFSET, TIMESTAMP, TOPICS columns. Delta tables are created for these kafka data, and is in landing layer. So in the RAW layer, we process these files and explode the VALUE column to create / append to existing tables in raw_db.

But what I am seeing is everyday, the data is being overwritten, and I am not clear, how to append the data to existing tables. I see some references to dlt.read_stream(), spark.readStream(), but not clear how we can apply in this scenario where the source is another db.

Appreciate any help/suggestions/sample code please.

Am herewith providing the code that I am using currently

import json
import dlt
import time
from pyspark.sql.functions import from_json, to_json, col, lit, coalesce
from pyspark.sql.types import StructType, StructField, StringType, LongType
from pyspark.sql.functions import date_format
    
def generate_table(table_name, df):
        @dlt.table(
            name= table_name,
            comment="Raw data for " + str(table_name),
            path= f"{raw_file_path}/{table_name}"
          )
        def create_live_table():
            return (df)
        return
    
    # table_column_schema is a dictionary containing TABLENAME as key and its schema as values
    for tbl_name in table_column_schema:
        df = spark.sql(f'select value, offset, timestamp from {db}.{tbl_name}')
        tbl_columns = table_column_schema[tbl_name]
        jsonkeys=df.collect()[0][0]
        json_keys = json.loads(jsonkeys)
    
        data = []
        for i in tbl_columns:
            data.append(StructField(f'{i}', StringType(), True))
        schema = f'StructType({data})'
    
        schema2 = spark.read.json(df.rdd.map(lambda row: row.value)).schema
        if json_keys["op"] == 'd':
            df2 = df.withColumn("value", from_json("value", schema2)).select(col('value.before.*'), col('value.op'))
        else:
            df2 = df.withColumn("value", from_json("value", schema2)).select(col('value.after.*'), col('value.op'))
        generate_table(tbl_name, df2)
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Yuva
  • 2,831
  • 7
  • 36
  • 60
  • Where exactly you are writing the the data? – JayashankarGS Jun 21 '23 at 05:32
  • Its the CDC data that we recieve via kafka, and am trying to append to the existing table, and the data contains CDC data with operation flag, primary key columns etc. The CDC would be applied in another notebook using DLT apply changes() – Yuva Jun 21 '23 at 06:09

1 Answers1

0

Yuva in the above code you seem to only read data not write the data.

Meanwhile, I can see you have written the append () logic but you have to write the data as well to datalake with spark write mode to implement this.

Meanwhile, there is a much simpler way in Spark called structure streaming which handles streaming data and updates the CDC feed as well.

just sharing reference code on how you can read and write using append/incremental load.

Reference Code:

READ: 

data.spark.readStream
          .table("Input_Table")

WRITE 
data.write streams
    .outputMode("append")
    .option("checkpoint location", "/path")
    .table(”Output_Table")
Mobeen
  • 76
  • 9
  • can this be used in DLT pipelines? also can you help me with the above code I have shared please. – Yuva Jun 23 '23 at 09:28
  • hi yuva you can follow this documentation is very much self explaiatory if you need further assistance let me know. https://docs.databricks.com/delta-live-tables/cdc.html – Mobeen Jun 23 '23 at 10:50