My source is azure datafactory which is copying files to containerA --> FolderA,FolderB, FolderC
. I am using below syntax to use the autoloader need to read the files as it comes to any one of the folder.
Mounting I have done till storage account
source = "abfss://containerA@storageaccount.dfs.core.windows.net/",
mount_point = "/mnt/containerA/",
extra_configs = configs)
Streaming code:
df1=spark.readStream.format("cloudFiles") \
.option("cloudFiles.format","Json") \
.option("cloudFiles.useNotifications","True") \
.option('cloudFiles.subscriptionId',"xxxx-xxxx-xxxx-xxxx-xxx") \
.option('cloudFiles.tenantId',"xxxx-1cxxxx98-xxxx-xxxx-xxxx") \
.option("cloudFiles.clientId","xxxx-xx-46d8-xx-xxx") \
.option("cloudFiles.clientSecret","xxxxxxxxxx") \
.option('cloudFiles.resourceGroup',"xxxx-xxx") \
.schema(Userdefineschema) \
.load("/mnt/containerA/") \
.withColumn("rawFilePath",input_file_name())
Above syntax is creating new queue always is there any way if I wanted to give name to the queue.
Issue when I am starting my stream and adf is copy data to folder A streaming is running fine. but when adf starts copy data to folder B streaming query is not fetchING records which is present in folder B in the same streaming session. But when I close the streaming cell and again start it will pick data for folder A and Folder B. My objective is to use autoloader when files comes in any of the folder stream starts automatically.
Kindly advice I am new to spark streaming.
Thanks Anuj gupta