Azure data bricks spark streaming with autoloader

Question

My source is azure datafactory which is copying files to containerA --> FolderA,FolderB, FolderC. I am using below syntax to use the autoloader need to read the files as it comes to any one of the folder.

Mounting I have done till storage account

    source = "abfss://containerA@storageaccount.dfs.core.windows.net/",
    mount_point = "/mnt/containerA/",
    extra_configs = configs)

Streaming code:

df1=spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format","Json") \
  .option("cloudFiles.useNotifications","True") \
  .option('cloudFiles.subscriptionId',"xxxx-xxxx-xxxx-xxxx-xxx") \
  .option('cloudFiles.tenantId',"xxxx-1cxxxx98-xxxx-xxxx-xxxx") \
  .option("cloudFiles.clientId","xxxx-xx-46d8-xx-xxx") \
  .option("cloudFiles.clientSecret","xxxxxxxxxx") \
  .option('cloudFiles.resourceGroup',"xxxx-xxx") \
  .schema(Userdefineschema) \
  .load("/mnt/containerA/") \
  .withColumn("rawFilePath",input_file_name())

Above syntax is creating new queue always is there any way if I wanted to give name to the queue.

Issue when I am starting my stream and adf is copy data to folder A streaming is running fine. but when adf starts copy data to folder B streaming query is not fetchING records which is present in folder B in the same streaming session. But when I close the streaming cell and again start it will pick data for folder A and Folder B. My objective is to use autoloader when files comes in any of the folder stream starts automatically.

Kindly advice I am new to spark streaming.

Thanks Anuj gupta

score 0 · Answer 1 · answered Mar 14 '22 at 18:28

0

Please try using to perform nested folder file lookup

.option("recursiveFileLookup", "true")

answered Mar 14 '22 at 18:28

Praneeth

313
4
9

Azure data bricks spark streaming with autoloader

1 Answers1