I ran into this issue for a reason similar to this user:
http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-NoClassDefFoundError-is-this-a-bug-td18972.html
I was calling a method on an object that had a few variables defined on the object itself, including spark and a logger, like this
val spark = SparkSession
.builder()
.getOrCreate()
val logger = LoggerFactory.getLogger(this.getClass.getName)
The function I was calling called another function on the object, which called another function, which called yet another function on the object inside of a flatMap
call on an rdd.
I was getting the NoClassDefFoundError
error in a stacktrace where the previous 2 function calls in the stack trace were functions on the class Spark was telling me did not exist.
Based on the conversation linked above, my hypothesis was that the global spark
reference wasn't getting initialized by the time the function that used it was getting called (the one that resulted in the NoClassDefFoundError
exception).
After quite a few experiments, I found that this pattern worked to resolve the problem.
// Move global definitions here
object MyClassGlobalDef {
val spark = SparkSession
.builder()
.getOrCreate()
val logger = LoggerFactory.getLogger(this.getClass.getName)
}
// Force the globals object to be initialized
import MyClassGlobalDef._
object MyClass {
// Functions here
}
It's kind of ugly, but Spark seems to like it.