Databricks-connect: sparkContext.wholeTextFiles

Question

I have setup databricks-connect version 5.5.0. This runtime includes Scala 2.11 and Spark 2.4.3. All the Spark code I have written has been correctly executed and without any issues until I tried calling sparkContext.wholeTextFiles. The error that I get is the following:

Exception in thread "main" java.lang.NoClassDefFoundError: shaded/databricks/v20180920_b33d810/com/google/common/base/Preconditions
    at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.ensureAuthority(AzureBlobFileSystem.java:775)
    at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:500)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:469)
    at org.apache.spark.SparkContext$$anonfun$wholeTextFiles$1.apply(SparkContext.scala:997)
    at org.apache.spark.SparkContext$$anonfun$wholeTextFiles$1.apply(SparkContext.scala:992)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.SparkContext.withScope(SparkContext.scala:820)
    at org.apache.spark.SparkContext.wholeTextFiles(SparkContext.scala:992)
    ...
Caused by: java.lang.ClassNotFoundException: shaded.databricks.v20180920_b33d810.com.google.common.base.Preconditions
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
    ... 20 more

One attempt at solving the problem was to move to the latest Databricks runtime - which at the time of this writing is 6.5. That didn't help. I have proceeded going back in versions - 6.4 and 6.3 - since they use different Spark versions but to no avail.

Another thing that I tried was adding "com.google.guava" % "guava" % "23.0" as a dependency to my build.sbt. That in itself results in errors like:

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: abfss://abc-fs@cloud.dfs.core.windows.net/paths/something, expected: file:///

I feel that going down the road of satisfying in and every dependency that somehow is not included in the jar may not be the best option.

I wonder if someone has had a similar experience and if so - how did they solve this.

I am willing to give more context if that is necessary.

Thank you!

Does this answer your question? [Can't connect to Azure Data Lake Gen2 using PySpark and Databricks Connect](https://stackoverflow.com/questions/56702280/cant-connect-to-azure-data-lake-gen2-using-pyspark-and-databricks-connect) — tomconte, Sep 25 '20 at 08:27
Unfortunately not. I have spoken to Databricks developers and they admit that databricks-connect in its current state is rather flawed. Hence I have moved away from it. — zaxme, Sep 25 '20 at 09:22

Databricks-connect: sparkContext.wholeTextFiles

0 Answers0