How to list the files in azure data lake using spark from pycharm(local IDE) which is connected using databricks-connect

Question

I am working on some code on my local machine on pycharm. The execution is done on a databricks cluster, while the data is stored on azure datalake.

basaically, I need to list down the files in azure datalake directory and then apply some reading logic on the files, for this I am using the below code

sc = spark.sparkContext
hadoop = sc._jvm.org.apache.hadoop

fs = hadoop.fs.FileSystem
conf = hadoop.conf.Configuration()

path = hadoop.fs.Path('adl://<Account>.azuredatalakestore.net/<path>')
for f in fs.get(conf).listStatus(path):
    print(f.getPath(), f.getLen())

the above code runs fine on the databricks notebooks, but when i try to run the same code through pycharm using databricks-connect i get the following error.

"Wrong FS expected: file:///....."

on some digging it turns out, that the code is looking in my local drive to find the "path". I had a similar issue with python libraries (os, pathlib)

I have no issue in running other code on the cluster.

Need help in figuring out how to run this so as to search the datalake and not my local machine.

Also, azure-datalake-store client is not an option due to certain restrictions.

could you try ``dbutils.fs.ls("SOMEPATH")`` ? – heck1 Feb 10 '20 at 12:25 — heck1, Feb 10 '20 at 12:25

score 0 · Answer 1 · answered Feb 22 '20 at 16:57

You may use this.

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{Path, FileSystem}
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.sql.execution.datasources.InMemoryFileIndex
import java.net.URI

def listFiles(basep: String, globp: String): Seq[String] = {
  val conf = new Configuration(sc.hadoopConfiguration)
  val fs = FileSystem.get(new URI(basep), conf)

  def validated(path: String): Path = {
    if(path startsWith "/") new Path(path)
    else new Path("/" + path)
  }

  val fileCatalog = InMemoryFileIndex.bulkListLeafFiles(
    paths = SparkHadoopUtil.get.globPath(fs, Path.mergePaths(validated(basep), validated(globp))),
    hadoopConf = conf,
    filter = null,
    sparkSession = spark)

  fileCatalog.flatMap(_._2.map(_.path))
}

val root = "/mnt/{path to your file directory}"
val globp = "[^_]*"

val files = listFiles(root, globp)
files.toDF("path").show()

How to list the files in azure data lake using spark from pycharm(local IDE) which is connected using databricks-connect

1 Answers1