I am using Azure data factory to copy source data into landing zone (adls gen2) and then using auto-loader to load into bronze delta tables. everything works perfectly except I am not able to derive pipeline_name, runid and trigger_time as derived columns in parquet file along with input source table.
schema here is structType built using actual source sql server table schema, it does not cover additional derived columns in ADF.
sourceFilePath = 'wasbs://landing-zone@dlslandingzonedev.blob.core.windows.net/' \
+ domain_name + '/' + database_name + '/' \
+ schema_table_name.replace(database_name+'.','') + '/'
df = (spark
.readStream
.format("cloudFiles")
.options(**cloudFile)
.schema(schema)
.option("rescueDataColumn", "_rescued_data")
.load(sourceFilePath)
)
# Traceability columns
# from pyspark.sql.functions import *
df = (
df.withColumn("audit_fileName", input_file_name())
.withColumn("audit_createdTimestamp", current_timestamp())
)
here is writestream DF
streamQuery = (df
.writeStream
.format("delta")
.outputMode("append")
.trigger(once=True)
.queryName(queryName)
.option("checkpointLocation",checkpointLocation)
.option("mergeSchema", "true")
.start(tablePath)
)
with mergeSchema True - I was expecting stream to detect 3 additional columns from data factory while it writes into delta format. Is this limitation of parquet? do I have read data as csv / json instead? or I have to add derived columns schema definition.