0

I am only having access to signed HTTPS urls for csv files (seperate for each file) ex:

https://<container_name>.blob.core.windows.net/<folder_name>/<file_name>.csv?sig=****st=****&se=****&sv=****&sp=r&sr=b

Below is the code I am using:

for blob_url in paths:
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName(f"test").getOrCreate()
    storage_account_name = '***'
    container_name = '***'
    url = blob_url.split("?")[0]
    access_key = '?'+blob_url.split("?")[1]  # tried without '?' also
    conf_path = "fs.azure.sas."+container_name+"."+storage_account_name+".blob.core.windows.net"
    spark.conf.set(conf_path, access_key)
    blob_path = "wasbs://"+container_name+"@"+storage_account_name+".blob.core.windows.net/"+url.split(".net/")[1]
    df = spark.read.csv(blob_path, header=False, inferSchema=True)
    df.show()

The first file read is successful. Next reads fail. Even if I change the order of files, only first one suceeds. I have tried to stop the spark session everytime in the loop. I have tried to give different spark session name everytime. Nothing seems to work.

Same code works in databricks but does not work in dataproc.

I want to read files in a sequence and persist it somewhere. I am not able to do so

Error: py4j.protocol.Py4JJavaError: An error occurred while calling o68.csv.
: org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
STerliakov
  • 4,983
  • 3
  • 15
  • 37

0 Answers0