I feel like I am going crazy with this. I have tested a data pipeline on my standard compute cluster. I am loading new files as batch from a Google Cloud Storage bucket. Autoloader works exactly as expected from my notebook on my compute cluster. Then, I simply used this notebook as a first task in a workflow using a new job cluster. In order to test this pipeline as a workflow I first removed all checkpoint files and directories before starting the run using this command.
dbutils.fs.rm(checkpoint_path, True)
For some reason, the code works perfectly when testing, but in workflows, I get "streaming stopped" and no data from autoloader. Here is my config for autoloader:
file_path = "gs://raw_zone_twitter"
table_name = f"twitter_data_autoloader"
checkpoint_path = f"/tmp/_checkpoint/twitter_checkpoint"
spark.sql(f"DROP TABLE IF EXISTS {table_name}")
query = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "text")
.option("cloudFiles.schemaLocation", checkpoint_path)
.load(file_path)
.withColumn("filePath", input_file_name())
.writeStream
.option("checkpointLocation", checkpoint_path)
.trigger(once=True)
.toTable(table_name))
When running this as a workflow I see that the checkpoint directory is created, but there is no data inside.
The code between testing on my compute cluster, and the task in my workflow is exactly the same (same notebook), so I really have no idea why autoloader is not working within my workflow...
To confirm, I am using the same IAM service account for both my compute cluster and my job cluster.