1

I just ran this:

dbutils.fs.ls("dbfs:/FileStore/")

I see this result:

[FileInfo(path='dbfs:/FileStore/import-stage/', name='import-stage/', size=0),
 FileInfo(path='dbfs:/FileStore/jars/', name='jars/', size=0),
 FileInfo(path='dbfs:/FileStore/job-jars/', name='job-jars/', size=0),
 FileInfo(path='dbfs:/FileStore/plots/', name='plots/', size=0),
 FileInfo(path='dbfs:/FileStore/tables/', name='tables/', size=0)]

Shouldn't there be something in filestore? I have hundreds of GB of data in a lake. I am having all kinds of problems getting Databricks to find these files. When I use Azure Data Factory, everything works perfectly fine. It's starting to drive me crazy!

For instance, when I run this:

dbutils.fs.ls("/mnt/rawdata/2019/06/28/parent/")

I get this message:

java.io.FileNotFoundException: File/6199764716474501/mnt/rawdata/2019/06/28/parent does not exist.

I have tens of thousands of files in my lake! I can't understand why I can't get a list these files!!

ASH
  • 20,759
  • 19
  • 87
  • 200

1 Answers1

2

In Azure Databricks, this is expected behaviour.

  • For Files it displays the actual file size.
  • For Directories it displays the size=0

Example: In dbfs:/FileStore/ I have three files shown in white color and three folders shown in blue color. Checking the file size using databricks cli.

dbfs ls -l dbfs:/FileStore/

enter image description here

When you check out the result using dbutils as follows:

dbutils.fs.ls("dbfs:/FileStore/")

enter image description here

Important point to remember while reading the files larger than 2GB:

  • Support only files less than 2GB in size. If you use local file I/O APIs to read or write files larger than 2GB you might see corrupted files. Instead, access files larger than 2GB using the DBFS CLI, dbutils.fs, or Spark APIs or use the /dbfs/ml folder described in Local file APIs for deep learning.
  • If you write a file using the local file I/O APIs and then immediately try to access it using the DBFS CLI, dbutils.fs, or Spark APIs, you might encounter a FileNotFoundException, a file of size 0, or stale file contents. That is expected because the OS caches writes by default. To force those writes to be flushed to persistent storage (in our case DBFS), use the standard Unix system call sync.

There are multiple way to solve this issue. You may checkout similar SO thread answered by me.

Hope this helps.

CHEEKATLAPRADEEP
  • 12,191
  • 1
  • 19
  • 42