How to walk through ADLS folder using python?

Question

I am using the below code snippet to walk through the folders and files in dbfs using python:

for subdir, dirs, files in os.walk("/dbfs/data"):
    for file in files:
        if re.search(contrast, file):
            df = pd.read_csv(os.path.join(subdir, file))
            tot_contrast_sh = tot_contrast_sh.append(df, sort=False)
        elif re.search(contrast_rolled, file):
            df = pd.read_csv(os.path.join(subdir, file))
            tot_contrast_rolled_sh = tot_contrast_rolled_sh.append(df, sort=False)

I want to implement the above functionality with python and pandas but the folder is located in adls, how should I proceed with this? Is there a way to implement this?

score 0 · Answer 1 · answered Sep 22 '22 at 09:17

To walk through the folders of ADLS in databricks, first you need to mount the ADLS to databricks.

Mount the ADLS to databricks using Service principal.

To do that, create app registration in the Azure.

Go to Azure Active directory->App registration->New registration and create one.

App registration Overview:

enter image description here

Now create a Secret in your App registration.

enter image description here

Code for mounting:

configs = {"fs.azure.account.auth.type": "OAuth",

"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",

"fs.azure.account.oauth2.client.id": "< your client id >",

"fs.azure.account.oauth2.client.secret": "< Secret value >",

"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/< Directory (tenant) id >/oauth2/token"}

dbutils.fs.mount(

source = "abfss://< container >@< Storage account >.dfs.core.windows.net/",

mount_point = "/mnt/< mountpoint >",

extra_configs = configs)

My mounting:

enter image description here

Now you can access the ADLS folders and files with path /dbfs/mnt/< mountpoint > in your code.

My files in ADLS:

enter image description here

My files in databricks for sample:

enter image description here

Due to security concerns, mounting the ADLS data onto databricks is not allowed, because there are other use cases that are using the same databricks workspace and if we mount the adls to dbfs then other use cases will also have access to that mount point. — user19930511, Sep 22 '22 at 13:30

score 0 · Answer 2 · answered Sep 22 '22 at 14:31

I developed the below code to achieve this in ADLS and without mounting the adls to dbfs:

files = dbutils.fs.ls(abfss://data)
while(files):
  for file in files:
    if file.path.endswith("/"):
      files.extend(dbutils.fs.ls(file.path))
    elif file.path.endswith(".csv"):
      if file.path.__contains__(saida_import):
        df = pd.read_csv(file.path, storage_options = adls_cred)
        tot_sh = tot_sh.append(df, sort=False)
      elif file.path.__contains__(tableau_import):
        df = pd.read_csv(file.path, storage_options = adls_cred)
        tot_tableau_sh = tot_tableau_sh.append(df, sort=False)
      elif file.path.__contains__(funnel_import):
        df = pd.read_csv(file.path, storage_options = adls_cred)
        tot_funnel_sh = tot_funnel_sh.append(df, sort=False)

    files.remove(file)

How to walk through ADLS folder using python?

2 Answers2