1

I am using the below code snippet to walk through the folders and files in dbfs using python:

for subdir, dirs, files in os.walk("/dbfs/data"):
    for file in files:
        if re.search(contrast, file):
            df = pd.read_csv(os.path.join(subdir, file))
            tot_contrast_sh = tot_contrast_sh.append(df, sort=False)
        elif re.search(contrast_rolled, file):
            df = pd.read_csv(os.path.join(subdir, file))
            tot_contrast_rolled_sh = tot_contrast_rolled_sh.append(df, sort=False)

I want to implement the above functionality with python and pandas but the folder is located in adls, how should I proceed with this? Is there a way to implement this?

user19930511
  • 299
  • 2
  • 15

2 Answers2

0

To walk through the folders of ADLS in databricks, first you need to mount the ADLS to databricks.

Mount the ADLS to databricks using Service principal.

To do that, create app registration in the Azure.

Go to Azure Active directory->App registration->New registration and create one.

App registration Overview:

enter image description here

Now create a Secret in your App registration.

enter image description here

Code for mounting:

configs = {"fs.azure.account.auth.type": "OAuth",

"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",

"fs.azure.account.oauth2.client.id": "< your client id >",

"fs.azure.account.oauth2.client.secret": "< Secret value >",

"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/< Directory (tenant) id >/oauth2/token"}

dbutils.fs.mount(

source = "abfss://< container >@< Storage account >.dfs.core.windows.net/",

mount_point = "/mnt/< mountpoint >",

extra_configs = configs)

My mounting:

enter image description here

Now you can access the ADLS folders and files with path /dbfs/mnt/< mountpoint > in your code.

My files in ADLS:

enter image description here

My files in databricks for sample:

enter image description here

Rakesh Govindula
  • 5,257
  • 1
  • 2
  • 11
  • Due to security concerns, mounting the ADLS data onto databricks is not allowed, because there are other use cases that are using the same databricks workspace and if we mount the adls to dbfs then other use cases will also have access to that mount point. – user19930511 Sep 22 '22 at 13:30
  • The traversal has to be done on adls with abfss://path – user19930511 Sep 22 '22 at 13:31
0

I developed the below code to achieve this in ADLS and without mounting the adls to dbfs:

files = dbutils.fs.ls(abfss://data)
while(files):
  for file in files:
    if file.path.endswith("/"):
      files.extend(dbutils.fs.ls(file.path))
    elif file.path.endswith(".csv"):
      if file.path.__contains__(saida_import):
        df = pd.read_csv(file.path, storage_options = adls_cred)
        tot_sh = tot_sh.append(df, sort=False)
      elif file.path.__contains__(tableau_import):
        df = pd.read_csv(file.path, storage_options = adls_cred)
        tot_tableau_sh = tot_tableau_sh.append(df, sort=False)
      elif file.path.__contains__(funnel_import):
        df = pd.read_csv(file.path, storage_options = adls_cred)
        tot_funnel_sh = tot_funnel_sh.append(df, sort=False)

    files.remove(file)
user19930511
  • 299
  • 2
  • 15