2

I'm getting ClassCastException when trying to traverse the directories in mounted Databricks volume.

java.lang.ClassCastException: com.databricks.backend.daemon.dbutils.FileInfo cannot be cast to com.databricks.service.FileInfo
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
    at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
    at scala.collection.TraversableLike.map(TraversableLike.scala:238)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
    at scala.collection.AbstractTraversable.map(Traversable.scala:108)
    at com.mycompany.functions.UnifiedFrameworkFunctions$.getAllFiles(UnifiedFrameworkFunctions.scala:287)

where getAllFiles function looks like:

    import com.databricks.service.{DBUtils, FileInfo}
    ...
    def getAllFiles(path: String): Seq[String] = {
        val files = DBUtils.fs.ls(path)
        if (files.isEmpty)
            List()
        else
            files.map(file => { // line where exception is raised
                val path: String = file.path
                if (DBUtils.fs.dbfs.getFileStatus(new org.apache.hadoop.fs.Path(path)).isDirectory) getAllFiles(path)
                else List(path)
            }).reduce(_ ++ _)
    }

Locally it runs OK with Databricks Connect, but when src code is packaged as jar and run on Databricks cluster the above exception is raised.

Since Databricks in their documentation suggest using com.databricks.service.DBUtils and when calling DBUtils.fs.ls(path) it returns FileInfo from the same service package - is this a bug or should the api be used in some other way?

I'm using Databricks Connect & Runtime of version 8.1

marknorkin
  • 3,904
  • 10
  • 46
  • 82

2 Answers2

3

I have tried a workaround to get file names from a folder.

I have performed following steps to get list of filenames from mounted directory.

I have stored 3 files at โ€œmnt/Sales/โ€ location.

Step1: Use display(dbutils.fs.ls(โ€œ/mnt/Sales/โ€)) command. enter image description here

Step2: Assign file location to a variable.

enter image description here

Step3: Load variable to dataframe and get names of files.

enter image description here

Abhishek K
  • 3,047
  • 1
  • 6
  • 19
1

You could convert the directories (of type Seq[com.databricks.service.FileInfo]) to a string, split the string, and use pattern matching to extract the file names as you traverse the new Array[String]. Using scala:

val files = dbutils.fs.ls(path).mkString(";").split(";")
val pattern = "dbfs.*/(Sales_Record[0-9]+.csv)/.*".r
files.map(file => { val pattern(res) = file; res })

Or you could try

val pattern = "dbfs.*/(.*.csv)/.*".r

to get all file names ending in csv. The pattern can be constructed to suit your needs.

vando026
  • 21
  • 3