I have a spark stream that reads data from an azure data lake, applies some transformations then writes into the azure synapse (DW). I wanna log some metrics for each batch processed. but I don't wanna duplicate logs from each batch. Is there any way to log only once instead with some export_interval?
Example:
autoloader_df = (
spark.readStream.format("cloudFiles")
.options(**stream_config["cloud_files"])
.option("recursiveFileLookup", True)
.option("maxFilesPerTrigger", sdid_workload.max_files_agg)
.option("pathGlobfilter", "*_new.parquet")
.schema(stream_config["schema"])
.load(stream_config["read_path"])
.withColumn(stream_config["file_path_column"], input_file_name())
)
stream_query = (
autoloader_df.writeStream.format("delta")
.trigger(availableNow=True)
.option("checkpointLocation", stream_config["checkpoint_location"])
.foreachBatch(
lambda df_batch, batch_id: ingestion_process(
df_batch, batch_id, sdid_workload, stream_config, logger=logger
)
)
.start()
)
Where ingestion process is as follows:
def ingestion_process(df_batch, batch_id, sdid_workload, stream_config, **kwargs):
logger: AzureLogger = kwargs.get("logger")
iteration_start_time = datetime.utcnow()
sdid_workload.ingestion_iteration += 1
general_transformations(sdid_workload)
log_custom_metrics(sdid_workload)
`
In log_custom_metrics I'm using:
exporter = metrics_exporter.new_metrics_exporter(connection_string=appKey, export_interval=12)
view_manager.register_exporter(exporter)
I don’t want duplicated logs