I am only having access to signed HTTPS urls for csv files (seperate for each file) ex:
https://<container_name>.blob.core.windows.net/<folder_name>/<file_name>.csv?sig=****st=****&se=****&sv=****&sp=r&sr=b
Below is the code I am using:
for blob_url in paths:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(f"test").getOrCreate()
storage_account_name = '***'
container_name = '***'
url = blob_url.split("?")[0]
access_key = '?'+blob_url.split("?")[1] # tried without '?' also
conf_path = "fs.azure.sas."+container_name+"."+storage_account_name+".blob.core.windows.net"
spark.conf.set(conf_path, access_key)
blob_path = "wasbs://"+container_name+"@"+storage_account_name+".blob.core.windows.net/"+url.split(".net/")[1]
df = spark.read.csv(blob_path, header=False, inferSchema=True)
df.show()
The first file read is successful. Next reads fail. Even if I change the order of files, only first one suceeds. I have tried to stop the spark session everytime in the loop. I have tried to give different spark session name everytime. Nothing seems to work.
Same code works in databricks but does not work in dataproc.
I want to read files in a sequence and persist it somewhere. I am not able to do so
Error: py4j.protocol.Py4JJavaError: An error occurred while calling o68.csv.
: org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.