1

I am new to Databricks and PySpark and I am debugging a code.

I am trying to debug a code which uses Auto Loader. I expect 10 files to be received every 2 hours into a storage account and I have confirmed that they are in fact.

I would like to know what files where processed last and the max(timestamp) of the records processed to determine if autoloader is reading those files. The records from the source files have a timestamp column ('UTC Time Stamp').

connection_string = "*****************"
file_location = "*****************"
file_type = "csv"
df = spark.read.format(file_type).option("header", "true").option("inferSchema", "true").load(file_location)

df_stream_in = spark.readStream.format("cloudFiles").option("cloudFiles.useNotifications", True).option("cloudFiles.format", "csv")\
            .option("cloudFiles.connectionString", connection_string)\
            .option("cloudFiles.resourceGroup", "*****************")\
            .option("cloudFiles.subscriptionId", "*****************")\
            .option("cloudFiles.tenantId", "*****************")\
            .option("cloudFiles.clientId", "*****************")\
            .option("cloudFiles.clientSecret", "*****************")\
            .option("cloudFiles.region", "*****************")\
            .option("header", "true")\
            .schema(dataset_schema)\
            .option("cloudFiles.includeExistingFiles", True).load(file_location)

The df_stream_in undergoes transformations using df_stream_transformed = df_stream_in.withColumn function and gets written to a delta table finally. The delta table doesn't have records from 2023. I want to understand if autoloader is reading the latest files before I look further.

I saw from Querying Files Discovered by AutoLoader that the following SQL code can be used, but the code above as you see doesn't have that checkpoint option.

SELECT * FROM cloud_files_state('path/to/checkpoint');

At the end of it, I want to confirm the files are read by the autoloader. I am not sure how to get the filenames.

learner
  • 833
  • 3
  • 13
  • 24

1 Answers1

0

The checkpoint must be set when you're writing your stream somewhere, like,

df_stream_in.writeStream \
  .format("delta") \
  .option("checkpointLocation", "/path/to/checkpoint") \
  .start("/output/path")

This is because the single input stream could be used to produce multiple outputs, and each output tracks its own state separately from others.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Could you please elaborate a bit more on the last statement. What do you mean by a single input could be used by multiple outputs? – learner Mar 12 '23 at 11:22
  • Yes. For example, if you split stream into two by something like, `good = df.filter(...); good.writeStream....` and `bad = df.filter(...); bad.writeStream....` then they will require two independent checkpoints – Alex Ott Mar 12 '23 at 11:51
  • https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#starting-streaming-queries – Alex Ott Mar 12 '23 at 11:52