1

The goal is to read a file as a byte string within Databricks from an ADLS mount point.

Confirming the ADLS mount point

Firstly, using dbutils.fs.mounts() it is confirmed to have the following:

... MountInfo(mountPoint='/mnt/ftd', source='abfss://ftd@omitted.dfs.core.windows.net/', encryptionType=''), ...

Confirming the existence of the file

The file under question is titled TruthTable.csv, its whereabouts have been confirmed using the following command:

dbutils.fs.ls('/mnt/ftd/TruthTable.csv')

which returns:

[FileInfo(path='dbfs:/mnt/ftd/TruthTable.csv', name='TruthTable.csv', size=156)]

Confirming the readability of the file

To confirm that the file can be read we can run the following snippet.

filePath = '/mnt/ftd/TruthTable.csv'
spark.read.format('csv').option('header','true').load(filePath)

which successfully returns

DataFrame[p: string, q: string, r: string, s: string]

The problem

As the goal is to be able to read a file as a byte string, the following snippet should be successful, however, it is not.

filePath = '/mnt/ftd/TruthTable.csv'
with open(filePath, 'rb') as fin:
  contents = fin.read()
  print(contents)

Executing the following snippet outputs:

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/ftd/TruthTable.csv'

The documentation provided by the Databricks team on the following link [https://docs.databricks.com/data/databricks-file-system.html#local-file-apis][https://docs.databricks.com/data/databricks-file-system.html#local-file-apis] works only for files found in the /tmp/ folder, however, the requirement is the read a file directly from the mount point.

Filip Markoski
  • 333
  • 3
  • 19

2 Answers2

0

Please add dbfs prefix:

filePath = '/dbfs/mnt/ftd/TruthTable.csv'
with open(filePath, 'rb') as fin:
  contents = fin.read()
  print(contents)

For native databricks function (like dbutils) dbfs is used as default location. When you access file system directly you need to add /dbfs which is default mount directory. Alternatively you can use 'dbfs:/mnt/ftd/TruthTable.csv'. If you use free community edition it will not work at all as there is no access to underlying file system. For Azure, Aws and Google edition it should work.

Hubert Dudek
  • 1,666
  • 1
  • 13
  • 21
  • I have in fact tried this approach, however, there is an exception. For instance, `dbutils.fs.ls('/mnt/ftd/MyServerType/MyServerName/LandingExperiments/dbo.SampleTsv/SampleTsv.txt')` confirms the existence of the file, but having `filePath = '/dbfs/mnt/ftd/MyServerType/MyServerName/LandingExperiments/dbo.SampleTsv/SampleTsv.txt'` results in a `FileNotFoundError`. It seems that the mount point cannot access the contents as easily when multiple nested folders are involved in the path. Also, I have access to the paid version of Azure Databricks. – Filip Markoski Nov 17 '21 at 13:44
  • please try magic command to validate folder structure step by step with shell also %sh ls / ls /dbfs/mnt/ftd/MyServerType/MyServerName/LandingExperiments/dbo.SampleTsv/ etc. – Hubert Dudek Nov 17 '21 at 13:55
  • After executing multiple ls commands, traversing folder by folder, starting from the root, the Python open file code executes successfully. However, this is an approach that I have discovered previously. I found an error in my code that simulates this `ls`-based traversal before executing the open-file Python code. I will continue testing the reliability of this approach, because this does not seem a 100% reliable. – Filip Markoski Nov 17 '21 at 14:22
  • can you add the final working code as answer to your own question? I have the same issue! – Laurens Koppenol Jan 27 '22 at 08:42
0

I was able to read the file by replacing the s3a:// and bucket prefix, to the corresponding /dbfs/mnt/ one.

s3a://s3-bucket/lake/output/dept/2022/09/16/20220916_1643_764250.csv /dbfs/mnt/output/dept/2022/09/16/20220916_1643_764250.csv

I used this:

_path = _path.replace('s3a://s3-bucket/lake', '/dbfs/mnt')

Hope it helps.

-ed

Edgardo_SA
  • 21
  • 1
  • 4