8

I'm able to establish a connection to my Databricks FileStore DBFS and access the filestore.

Reading, writing, and transforming data with Pyspark is possible but when I try to use a local Python API such as pathlib or the OS module I am unable to get past the first level of the DBFS file system

I can use a magic command:

%fs ls dbfs:\mnt\my_fs\... which works perfectly and lists all the child directories?

but if I do os.listdir('\dbfs\mnt\my_fs\') it returns ['mount.err'] as a return value

I've tested this on a new cluster and the result is the same

I'm using Python on a Databricks Runtine Version 6.1 with Apache Spark 2.4.4

is anyone able to advise.

Edit :

Connection Script :

I've used the Databricks CLI library to store my credentials which are formatted according to the databricks documentation:

 def initialise_connection(secrets_func):
  configs = secrets_func()
  # Check if the mount exists
  bMountExists = False
  for item in dbutils.fs.ls("/mnt/"):
      if str(item.name) == r"WFM/":
          bMountExists = True
      # drop if exists to refresh credentials
      if bMountExists:
        dbutils.fs.unmount("/mnt/WFM")
        bMountExists = False

      # Mount a drive
      if not (bMountExists):
          dbutils.fs.mount(
              source="adl://test.azuredatalakestore.net/WFM",
              mount_point="/mnt/WFM",
              extra_configs=configs
          )
          print("Drive mounted")
      else:
          print("Drive already mounted")
Umar.H
  • 22,559
  • 7
  • 39
  • 74

3 Answers3

5

We experienced this issue when the same container was mounted to two different paths in the workspace. Unmounting all and remounting resolved our issue. We were using Databricks version 6.2 (Spark 2.4.4, Scala 2.11). Our blob store container config:

  • Performance/Access tier: Standard/Hot
  • Replication: Read-access geo-redundant storage (RA-GRS)
  • Account kind: StorageV2 (general purpose v2)

Notebook script to run to unmount all mounts in /mnt:

# Iterate through all mounts and unmount 
print('Unmounting all mounts beginning with /mnt/')
dbutils.fs.mounts()
for mount in dbutils.fs.mounts():
  if mount.mountPoint.startswith('/mnt/'):
    dbutils.fs.unmount(mount.mountPoint)

# Re-list all mount points
print('Re-listing all mounts')
dbutils.fs.mounts()

Minimal job to test on automated job cluster

Assuming you have a separate process to create the mounts. Create job definition (job.json) to run Python script on automated cluster:

{
  "name": "Minimal Job",
  "new_cluster": {
    "spark_version": "6.2.x-scala2.11",
    "spark_conf": {},
    "node_type_id": "Standard_F8s",
    "driver_node_type_id": "Standard_F8s",
    "num_workers": 2,
    "enable_elastic_disk": true,
    "spark_env_vars": {
      "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
    }
  },
  "timeout_seconds": 14400,
  "max_retries": 0,
  "spark_python_task": {
    "python_file": "dbfs:/minimal/job.py"
  }
}

Python file (job.py) to print out mounts:

import os

path_mounts = '/dbfs/mnt/'
print(f"Listing contents of {path_mounts}:")
print(os.listdir(path_mounts))

path_mount = path_mounts + 'YOURCONTAINERNAME'
print(f"Listing contents of {path_mount }:")
print(os.listdir(path_mount))

Run databricks CLI commands to run job. View Spark Driver logs for output, confirming that mount.err does not exist.

databricks fs mkdirs dbfs:/minimal
databricks fs cp job.py dbfs:/minimal/job.py --overwrite
databricks jobs create --json-file job.json
databricks jobs run-now --job-id <JOBID FROM LAST COMMAND>
danialk
  • 1,195
  • 11
  • 32
  • 1
    Thanks, for us this was due to something being changed in the Databricks API from 5.5 to 6.0 - that said I got around it using `dbutils` but wasn't fun. I don't have this issue on Gen 2. – Umar.H Jan 20 '20 at 14:53
  • hi @danialk , post using your script " ```Notebook script to run to unmount all mounts in /mnt: ```" , when I use the below command ```display(dbutils.fs.ls("/mnt/MLRExtract/excel_v1.xlsx"))``` my output is coming as ```wasbs://paycnt@sdvstr01.blob.core.windows.net/mnt/MLRExtract/excel_v1.xlsx ``` not as earlier -- ```dbfs://mnt/MLRExtract/excel_v1.xlsx``` Please suggest – kanishk kashyap May 04 '22 at 12:17
1

We have experienced the same issue when connecting to the an Azure Generation2 storage account (without hierarchical name spaces).

The error seems to occur when switching the Databricks Runtime Environment from 5.5 to 6.x. However, we have not been able to pinpoint the exact reason for this. We assume some functionality might have been deprecated.

bramb
  • 213
  • 2
  • 14
-1

Updating Answer: With Azure Data Lake Gen1 storage accounts: dbutils has access adls gen1 tokens/access creds and hence the file listing within mnt point works where as std py api calls do not have access to creds/spark conf, first call that you see is listing folders and its not making any calls to adls api's.

I have tested in Databricks Runtime version 6.1 (includes Apache Spark 2.4.4, Scala 2.11)

Commands works as excepted without any error message.

enter image description here

Update: Output for the inside folders.

enter image description here

Hope this helps. Could you please try and do let us know.

CHEEKATLAPRADEEP
  • 12,191
  • 1
  • 19
  • 42
  • 1
    Could you please add the screenshot with the complete error message in the question? And also please share the mountpoint source location? And DBFS API command which works? – CHEEKATLAPRADEEP Nov 25 '19 at 09:54
  • 1
    Thanks for the update, I will look into this shortly. – CHEEKATLAPRADEEP Nov 25 '19 at 10:19
  • 1
    I'm able to retrieve the files located inside the folder. This issue looks strange. – CHEEKATLAPRADEEP Nov 25 '19 at 10:43
  • 1
    I will try the same with ADLS Gen1 storage mount and check if the issue persists. – CHEEKATLAPRADEEP Nov 25 '19 at 10:49
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/203015/discussion-between-cheekatlapradeep-msft-and-datanovice). – CHEEKATLAPRADEEP Nov 25 '19 at 10:55
  • 1
    Hey sorry for the late response, I have tried from my end, but it throws different error message [OSError: [Errno 14] Bad address: '/dbfs/mnt/adlsgen1'] than yours. I will try out different possibilities and try to reach out the product ream for more information. This could take some time. If you need immediate assistance please do open a support ticket. – CHEEKATLAPRADEEP Nov 26 '19 at 10:36
  • 1
    sure, I will definately, update once I have answer for this issue. – CHEEKATLAPRADEEP Nov 26 '19 at 10:39
  • 1
    Got response from PG: dbutils has access adls gen1 tokens/access creds and hence the file listing within mnt point works where as std py api calls do not have access to creds/spark conf, first call that you see is listing folders and its not making any calls to adls api's. – CHEEKATLAPRADEEP Dec 03 '19 at 04:49
  • 1
    As per the conservation with PG, they have confirmed that ADLS Gen1 works with dbutils.fs.ls and it doesn't works with os.listdir because py api calls do not have access to creds/spark configuration? Hope this helps – CHEEKATLAPRADEEP Dec 03 '19 at 06:53
  • 1
    Downvoting because this does appear to be broken on >5.5, mounted gen1's are listable on 5.5 at least. If this is a bug, it should be fixed; if it is a feature, that's unwanted. Thanks, – reim Mar 03 '20 at 10:54
  • 2
    Additionally, listing fails on 5.5 for a lake where we don't have rootdir rwx, while this works on >5.5 - what a Databricks disaster.. – reim Mar 03 '20 at 11:17
  • 2
    on 6.4 python listing on the gen1 seems to be working; tx, – reim Mar 03 '20 at 11:42
  • @reim have to work on a gen1 project in a week that's great to know, thanks! – Umar.H Jul 10 '20 at 11:32