2

I am using a python notebook to mount dbfs on adls , now I want to add this to the init scrip so this can be done during the job cluster start

this is the python code I am using how to make this run as the init script please:

environment = "development"
scopeCredentials = "test-" + environment

# Secrets
# ADLS
app_id = dbutils.secrets.get(scope=scopeCredentials, key="app_id")
key = dbutils.secrets.get(scope=scopeCredentials, key="key")
adls_name = dbutils.secrets.get(scope=scopeCredentials, key="adls-name")

# Configs
# ADLS
adls_configs = {
  "dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
  "dfs.adls.oauth2.client.id": app_id, #id is the AppId of the service principal
  "dfs.adls.oauth2.credential": key,
  "dfs.adls.oauth2.refresh.url": "url"
}

mount_point="mount_point"
if any(mount.mountPoint == mount_point  for mount in dbutils.fs.mounts()):
  print("Storage: " + mount_point + " already mounted")
else:
  try:
    dbutils.fs.mount(
      source = "source",
      mount_point = "mount_point",
      extra_configs = adls_configs)
    print("Storage: " + mount_point + " successfully mounted")
  except:
    print("Storage: " + mount_point + " not mounted")
    pass

any idea how to change this to make it as a bash init script ?

scalacode
  • 1,096
  • 1
  • 16
  • 38

1 Answers1

1

Mounting of the storage needs to be done once, or when you change credentials of the service principal. Unmount & mount during the execution may lead to a problems when somebody else is using that mount from another cluster.

If you really want to access storage only from that cluster, then you need to configure that properties in the cluster's Spark Conf, and access data directly using abfss://... URIs (see docs for details). Mounting the storage just for time of execution of the cluster doesn't make sense from the security perspective, because during that time, anyone in workspace can access mounted data, as mount is global, not local to a cluster.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132