Why can't Databricks Python read from my Azure Datalake Storage Gen1?

Question

I am trying to read a file mydir/mycsv.csv from Azure Data Lake Storage Gen1 from a Databricks notebook, using the syntax (inspired by the documentation)

configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
           "dfs.adls.oauth2.client.id": "123abc-1e42-31415-9265-12345678",
           "dfs.adls.oauth2.credential": dbutils.secrets.get(scope = "adla", key = "adlamaywork"),
           "dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/123456abc-2718-aaaa-9999-42424242abc/oauth2/token"}

dbutils.fs.mount(
  source = "adl://myadls.azuredatalakestore.net/mydir",
  mount_point = "/mnt/adls",
  extra_configs = configs)

post_processed = spark.read.csv("/mnt/adls/mycsv.csv").collect()

post_processed.head(10).to_csv("/dbfs/processed.csv")

dbutils.fs.unmount("/mnt/adls")

My client 123abc-1e42-31415-9265-12345678 has access to the Data Lake Storage myadls and I have created secrets with

databricks secrets put --scope adla --key adlamaywork

When I execute the pyspark code above in the Databricks notebook, when accessing the csv file with spark.read.csv, I get

com.microsoft.azure.datalake.store.ADLException: Error getting info for file /mydir/mycsv.csv

When navigating the dbfs with dbfs ls dbfs:/mnt/adls, the parent mount point seems to be there, but I get

Error: b'{"error_code":"IO_ERROR","message":"Error fetching access token\nLast encountered exception thrown after 1 tries [HTTP0(null)]"}'

What am I doing wrong?

I was getting this error consistently, to the point I think I am doing something fundamentally wrong in some of the things I try above. — Davide Fiocco, Aug 28 '19 at 20:16
Did you ever get around this? having Exactly the same issue, @MartinJaffer-MSFT did you ever raise this to product? — Umar.H, Nov 27 '19 at 10:02
@Datanovice I dropped this eventually, and didn't try any further :( Else I would have answered myself! ;) — Davide Fiocco, Nov 27 '19 at 12:31
It's very annoying I've had to write several custom functions to work with the Gen1 lake which would be made totally redundant if I could use any of the python libraries with the file system. I wonder if this issue is across the board..! thanks for the update nonethless! — Umar.H, Nov 27 '19 at 12:36

score 1 · Answer 1 · edited Dec 05 '19 at 17:01

If you do not necessarily need to mount the directory into dbfs, you could try to read directly from adls, like this :

spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.access.token.provider", "org.apache.hadoop.fs.adls.oauth2.ConfCredentialBasedAccessTokenProvider")
spark.conf.set("dfs.adls.oauth2.client.id", "123abc-1e42-31415-9265-12345678")
spark.conf.set("dfs.adls.oauth2.credential", dbutils.secrets.get(scope = "adla", key = "adlamaywork"))
spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/123456abc-2718-aaaa-9999-42424242abc/oauth2/token")

csvFile = "adl://myadls.azuredatalakestore.net/mydir/mycsv.csv"

df = spark.read.format('csv').options(header='true', inferschema='true').load(csvFile)

Why can't Databricks Python read from my Azure Datalake Storage Gen1?

1 Answers1