0

I am ingesting data from RAW layer (ADLS gen2) to Bronze layer with databricks using Autoloader. These are not real time data but batch data and everyday we get new files in the raw path which comes via adf. Now for one of the dataset i am doing a historical data load (2014-2023) for initial load and then onwards it's incremental and in bronze and silver the load type will be append. while ingesting to Bronze in the dataframe i am adding an extra column as "lh_load_date" which is basically tells the load_date_time . this column helps to understand when was the data loaded in the lakehouse. Because of the data volume (around 1176 Million) in bronze data is getting ingested in multiple batch as a result the column "lh_load_date" is having different time. i am expecting it to be same across all the records for the initial load. The below is the example of how i am writing it in bronze.

enter image description here

And this is the example of the "lh_load_date" value. enter image description here

as you can see there is a interval of 2 hrs. Now the problem is in silver i am droping duplicates based on all the columns including "lh_laod_date" but as shown in the above example the time is differet, these two records are not being considered as duplicates.

can anyone suggest how to avoid this so that in "lh_load_date" i get same load_date_time across all the rows even if the data volume is too large.

sayan nandi
  • 83
  • 1
  • 6

1 Answers1

0

You are computing the value lh_load_date every micro batch, hence it produces a different timestamp on some rows.

If you are running this as a batch job that terminates after completion, you might consider defining the timestamp at the beginning of the job and then appending it to the dataframe as a literal value.


from datetime import datetime
from pyspark.sql import functions as F

batch_ts = datetime.now()

# ...

def overwrite_microbatch(raw_df, batchId):
  raw_df.withColumn('lh_load_date', F.lit(batch_ts)) \
    .write.format("delta")....

Also, you do not need to use .foreachBatch in your example since Databricks supports Delta Lake as a streaming sink and you only writing to 1 destination. You can simplify the code to:


from datetime import datetime
from pyspark.sql import functions as F

batch_ts = datetime.now()

raw_df = spark.readStream.format("cloudFiles")... # AutoLoader configuration

(
  raw_df
    .withColumn("lh_load_date", F.lit(batch_ts))
    .writeStream
    .format("delta")
    .option("mergeSchema", "true")
    .option("delta.columnMapping.mode", "name")
    .option("checkpointLocation", checkpointPath)
    .option(...)
    .trigger(availableNow=True)
    .toTable(f"{catalog_name}.{bronzeDatabaseName}.{tableName}")
    .start()
)
Zach King
  • 798
  • 1
  • 8
  • 21