I am new to Databricks and PySpark and I am debugging a code.
I am trying to debug a code which uses Auto Loader. I expect 10 files to be received every 2 hours into a storage account and I have confirmed that they are in fact.
I would like to know what files where processed last and the max(timestamp) of the records processed to determine if autoloader is reading those files. The records from the source files have a timestamp column ('UTC Time Stamp').
connection_string = "*****************"
file_location = "*****************"
file_type = "csv"
df = spark.read.format(file_type).option("header", "true").option("inferSchema", "true").load(file_location)
df_stream_in = spark.readStream.format("cloudFiles").option("cloudFiles.useNotifications", True).option("cloudFiles.format", "csv")\
.option("cloudFiles.connectionString", connection_string)\
.option("cloudFiles.resourceGroup", "*****************")\
.option("cloudFiles.subscriptionId", "*****************")\
.option("cloudFiles.tenantId", "*****************")\
.option("cloudFiles.clientId", "*****************")\
.option("cloudFiles.clientSecret", "*****************")\
.option("cloudFiles.region", "*****************")\
.option("header", "true")\
.schema(dataset_schema)\
.option("cloudFiles.includeExistingFiles", True).load(file_location)
The df_stream_in undergoes transformations using df_stream_transformed = df_stream_in.withColumn function and gets written to a delta table finally. The delta table doesn't have records from 2023. I want to understand if autoloader is reading the latest files before I look further.
I saw from Querying Files Discovered by AutoLoader that the following SQL code can be used, but the code above as you see doesn't have that checkpoint option.
SELECT * FROM cloud_files_state('path/to/checkpoint');
At the end of it, I want to confirm the files are read by the autoloader. I am not sure how to get the filenames.