7

In our project we're using com.typesafe:config in version 1.3.4. According to the latest release notes, this dependency is already provided by Databricks on the cluster, but in a very old version (1.2.1). How can I overwrite the provided dependency with our own version?

We use maven, in our dependencies I have

<dependency>
    <groupId>com.typesafe</groupId>
    <artifactId>config</artifactId>
    <version>1.3.4</version>
</dependency>

Our created jar file should therefore contain the newer version.

I created a Job by uploading the jar file. The Job fails because it can't find a method that was added after version 1.2.1, so it looks like the library we provided gets overwritten by the older version on the cluster.

pgruetter
  • 1,184
  • 1
  • 11
  • 29
  • Without knowing better is the Databricks dependency also defined in pom.xml? Or is it a dependency provided byt the deployment environment? – pirho Dec 19 '19 at 19:43
  • No, it's not defined in our pom.xml. A lot of libraries are pre-installed on the deployment environment according to which version of the Databricks runtime version you choose. – pgruetter Dec 20 '19 at 07:09
  • @pgruetter did you ever fix this? If so, how? Thanks! – Oscar Bonilla May 19 '20 at 17:47
  • 1
    @OscarBonilla: Yes, forgot to update. We did fix it, see my new answer. Hope that helps. – pgruetter May 27 '20 at 06:04

3 Answers3

3

In the end we have fixed this by shading the relevant classes, by adding the following to our build.sbt

assemblyShadeRules in assembly := Seq(
  ShadeRule.rename("com.typesafe.config.**" -> "shadedSparkConfigForSpark.@1").inAll
)
Oscar Bonilla
  • 339
  • 1
  • 16
3

We solved it in the end by utilizing Sparks ChildFirstURLClassLoader. The project is open source so you can check it out yourself here and the usage of the method here.

But for reference, here is the method in its entirety. You need to provide a Seq of jars that you want to override with your own, in our case it's the typesafe config.

def getChildFirstClassLoader(jars: Seq[String]): ChildFirstURLClassLoader = {
  val initialLoader = getClass.getClassLoader.asInstanceOf[URLClassLoader]

  @tailrec
  def collectUrls(clazz: ClassLoader, acc: Map[String, URL]): Map[String, URL] = {

    val urlsAcc: Map[String, URL] = acc++
      // add urls on this level to accumulator
      clazz.asInstanceOf[URLClassLoader].getURLs
      .map( url => (url.getFile.split(Environment.defaultPathSeparator).last, url))
      .filter{ case (name, url) => jars.contains(name)}
      .toMap

    // check if any jars without URL are left
    val jarMissing = jars.exists(jar => urlsAcc.get(jar).isEmpty)
    // return accumulated if there is no parent left or no jars are missing anymore
    if (clazz.getParent == null || !jarMissing) urlsAcc else collectUrls(clazz.getParent, urlsAcc)
  }

  // search classpath hierarchy until all jars are found or we have reached the top
  val urlsMap = collectUrls(initialLoader, Map())

  // check if everything found
  val jarsNotFound = jars.filter( jar => urlsMap.get(jar).isEmpty)
  if (jarsNotFound.nonEmpty) {
    logger.info(s"""available jars are ${initialLoader.getURLs.mkString(", ")} (not including parent classpaths)""")
    throw ConfigurationException(s"""jars ${jarsNotFound.mkString(", ")} not found in parent class loaders classpath. Cannot initialize ChildFirstURLClassLoader.""")
  }
  // create child-first classloader
  new ChildFirstURLClassLoader(urlsMap.values.toArray, initialLoader)
}

As you can see, it also has some logic to abort if the jar files you specified do not exist in the classpath.

pgruetter
  • 1,184
  • 1
  • 11
  • 29
0

Databricks supports the initialization script (cluster scope or global scope) so that you can install/remove any dependency. The details are at https://docs.databricks.com/clusters/init-scripts.html.

In your initialization script, you can remove the default jar file locates at databricks driver/executor classpath /databricks/jars/ and add the expected one there.

nathluu
  • 41
  • 3