how to add traceability columns autoloader - adf integration?

Question

I am using Azure data factory to copy source data into landing zone (adls gen2) and then using auto-loader to load into bronze delta tables. everything works perfectly except I am not able to derive pipeline_name, runid and trigger_time as derived columns in parquet file along with input source table.

schema here is structType built using actual source sql server table schema, it does not cover additional derived columns in ADF.

sourceFilePath = 'wasbs://landing-zone@dlslandingzonedev.blob.core.windows.net/' \
   + domain_name + '/' + database_name + '/' \
   + schema_table_name.replace(database_name+'.','') + '/'
df = (spark
     .readStream
     .format("cloudFiles")
     .options(**cloudFile)
     .schema(schema)
     .option("rescueDataColumn", "_rescued_data")
     .load(sourceFilePath)
     )
# Traceability columns
# from pyspark.sql.functions import *
df = (
  df.withColumn("audit_fileName", input_file_name()) 
    .withColumn("audit_createdTimestamp", current_timestamp())
)

here is writestream DF

streamQuery = (df
           .writeStream
           .format("delta")
           .outputMode("append")
           .trigger(once=True)
           .queryName(queryName)
           .option("checkpointLocation",checkpointLocation)
           .option("mergeSchema", "true")
           .start(tablePath)
          )

with mergeSchema True - I was expecting stream to detect 3 additional columns from data factory while it writes into delta format. Is this limitation of parquet? do I have read data as csv / json instead? or I have to add derived columns schema definition.

but where do you add these columns? they shouldn't be present in the incoming data until you explicitly add them — Alex Ott, Dec 06 '21 at 18:13
I add them explicitly in ADF pipeline which copies data from on prem rdbms to parquet file on data lake. — Shrikant Kulkarni, Dec 07 '21 at 02:17

score 0 · Answer 1 · answered May 30 '22 at 21:13

You can add the audit fields within the readStream command:

from pyspark.sql import functions as F

sourceFilePath = 'wasbs://landing-zone@dlslandingzonedev.blob.core.windows.net/' \
   + domain_name + '/' + database_name + '/' \
   + schema_table_name.replace(database_name+'.','') + '/'
df = (spark
     .readStream
     .format("cloudFiles")
     .options(**cloudFile)
     .schema(schema)
     .option("rescueDataColumn", "_rescued_data")
     .load(sourceFilePath)
     .withColumn("audit_fileName", input_file_name()) 
     .withColumn("audit_createdTimestamp", current_timestamp())
)
# Just displaying the dataframe with the audit columns:
df.display()

score 0 · Answer 2 · answered Jul 21 '22 at 21:38

My experience is that the schema evolution has completed by the time you do the addition of columns in the readStream. The practice I use is to seed the table with these columns and let schema evolution work on the incoming data.

CREATE Table IF NOT EXISTS db.tbl
(
  file_path               String,
  created_timestamp       Timestamp,
  last_modified_timestamp Timestamp,
  created_userid          String,
  _rescued_data           String
)
USING DELTA...

how to add traceability columns autoloader - adf integration?

2 Answers2