Not able to read multiple files from azure blob with https signed URL from dataproc pyspark

Question

I am only having access to signed HTTPS urls for csv files (seperate for each file) ex:

https://<container_name>.blob.core.windows.net/<folder_name>/<file_name>.csv?sig=****st=****&se=****&sv=****&sp=r&sr=b

Below is the code I am using:

for blob_url in paths:
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName(f"test").getOrCreate()
    storage_account_name = '***'
    container_name = '***'
    url = blob_url.split("?")[0]
    access_key = '?'+blob_url.split("?")[1]  # tried without '?' also
    conf_path = "fs.azure.sas."+container_name+"."+storage_account_name+".blob.core.windows.net"
    spark.conf.set(conf_path, access_key)
    blob_path = "wasbs://"+container_name+"@"+storage_account_name+".blob.core.windows.net/"+url.split(".net/")[1]
    df = spark.read.csv(blob_path, header=False, inferSchema=True)
    df.show()

The first file read is successful. Next reads fail. Even if I change the order of files, only first one suceeds. I have tried to stop the spark session everytime in the loop. I have tried to give different spark session name everytime. Nothing seems to work.

Same code works in databricks but does not work in dataproc.

I want to read files in a sequence and persist it somewhere. I am not able to do so

Error: py4j.protocol.Py4JJavaError: An error occurred while calling o68.csv.
: org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.

Not able to read multiple files from azure blob with https signed URL from dataproc pyspark

0 Answers0