Modify read data before writing in Databricks Autoloader

Asked Aug 10 '23 at 14:24

Active Aug 10 '23 at 22:13

Viewed 53 times

I'm implementing streaming reading from one dataset and writing to another dataset using Databricks Autoloader.

How can I apply some custom modification code to the read data before writing? E.g. something like this:

def my_modification(df):
  schema_columns = df.schema
  new_column_list = prepare_column_list(schema_columns)
  df = df.select(new_column_list)
  return df

(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", checkpoint_path)
  .load(file_path)
  .select("*", col("_metadata.file_path").alias("source_file"), current_timestamp().alias("processing_time"))
  .my_modification() # <= do it before writing and comparing the source and target schema
  .writeStream
  .option("checkpointLocation", checkpoint_path)
  .trigger(availableNow=True)
  .toTable(table_name))

edited Aug 10 '23 at 22:13

asked Aug 10 '23 at 14:24

archjkeee

What modifications you are doing? provide the error you facing. – JayashankarGS Aug 14 '23 at 12:18
I'm casting array fields to string. The error I'm facing is [CANNOT_MERGE_INCOMPATIBLE_DATA_TYPE] Failed to merge incompatible data types "string" and "array" – archjkeee Aug 14 '23 at 14:16
provide some sample data. – JayashankarGS Aug 14 '23 at 15:08

Modify read data before writing in Databricks Autoloader

0 Answers0