1

I am using Azure data factory to copy source data into landing zone (adls gen2) and then using auto-loader to load into bronze delta tables. everything works perfectly except I am not able to derive pipeline_name, runid and trigger_time as derived columns in parquet file along with input source table.

schema here is structType built using actual source sql server table schema, it does not cover additional derived columns in ADF.

sourceFilePath = 'wasbs://landing-zone@dlslandingzonedev.blob.core.windows.net/' \
   + domain_name + '/' + database_name + '/' \
   + schema_table_name.replace(database_name+'.','') + '/'
df = (spark
     .readStream
     .format("cloudFiles")
     .options(**cloudFile)
     .schema(schema)
     .option("rescueDataColumn", "_rescued_data")
     .load(sourceFilePath)
     )
# Traceability columns
# from pyspark.sql.functions import *
df = (
  df.withColumn("audit_fileName", input_file_name()) 
    .withColumn("audit_createdTimestamp", current_timestamp())
)

here is writestream DF

streamQuery = (df
           .writeStream
           .format("delta")
           .outputMode("append")
           .trigger(once=True)
           .queryName(queryName)
           .option("checkpointLocation",checkpointLocation)
           .option("mergeSchema", "true")
           .start(tablePath)
          )

with mergeSchema True - I was expecting stream to detect 3 additional columns from data factory while it writes into delta format. Is this limitation of parquet? do I have read data as csv / json instead? or I have to add derived columns schema definition.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132

2 Answers2

0

You can add the audit fields within the readStream command:

from pyspark.sql import functions as F

sourceFilePath = 'wasbs://landing-zone@dlslandingzonedev.blob.core.windows.net/' \
   + domain_name + '/' + database_name + '/' \
   + schema_table_name.replace(database_name+'.','') + '/'
df = (spark
     .readStream
     .format("cloudFiles")
     .options(**cloudFile)
     .schema(schema)
     .option("rescueDataColumn", "_rescued_data")
     .load(sourceFilePath)
     .withColumn("audit_fileName", input_file_name()) 
     .withColumn("audit_createdTimestamp", current_timestamp())
)
# Just displaying the dataframe with the audit columns:
df.display()
0

My experience is that the schema evolution has completed by the time you do the addition of columns in the readStream. The practice I use is to seed the table with these columns and let schema evolution work on the incoming data.

CREATE Table IF NOT EXISTS db.tbl
(
  file_path               String,
  created_timestamp       Timestamp,
  last_modified_timestamp Timestamp,
  created_userid          String,
  _rescued_data           String
)
USING DELTA...
Chris de Groot
  • 342
  • 1
  • 9