We have kafka topics to processed everyday as a batch process. When the pipeline is triggered, initially we load the kafka data as is to an ADLS location (Landing).
The data recieved on Kafka is all CDC data which will have KEY, VALUE, OFFSET, TIMESTAMP, TOPICS columns. Delta tables are created for these kafka data, and is in landing layer. So in the RAW layer, we process these files and explode the VALUE column to create / append to existing tables in raw_db.
But what I am seeing is everyday, the data is being overwritten, and I am not clear, how to append the data to existing tables. I see some references to dlt.read_stream(), spark.readStream(), but not clear how we can apply in this scenario where the source is another db.
Appreciate any help/suggestions/sample code please.
Am herewith providing the code that I am using currently
import json
import dlt
import time
from pyspark.sql.functions import from_json, to_json, col, lit, coalesce
from pyspark.sql.types import StructType, StructField, StringType, LongType
from pyspark.sql.functions import date_format
def generate_table(table_name, df):
@dlt.table(
name= table_name,
comment="Raw data for " + str(table_name),
path= f"{raw_file_path}/{table_name}"
)
def create_live_table():
return (df)
return
# table_column_schema is a dictionary containing TABLENAME as key and its schema as values
for tbl_name in table_column_schema:
df = spark.sql(f'select value, offset, timestamp from {db}.{tbl_name}')
tbl_columns = table_column_schema[tbl_name]
jsonkeys=df.collect()[0][0]
json_keys = json.loads(jsonkeys)
data = []
for i in tbl_columns:
data.append(StructField(f'{i}', StringType(), True))
schema = f'StructType({data})'
schema2 = spark.read.json(df.rdd.map(lambda row: row.value)).schema
if json_keys["op"] == 'd':
df2 = df.withColumn("value", from_json("value", schema2)).select(col('value.before.*'), col('value.op'))
else:
df2 = df.withColumn("value", from_json("value", schema2)).select(col('value.after.*'), col('value.op'))
generate_table(tbl_name, df2)