0

I'm having a problem using spark.ml.util.SchemaUtils on Spark v1.6.0. I get the following error:

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.ml.util.SchemaUtils$.appendColumn(Lorg/apache/spark/sql/types/StructType;Ljava/lang/String;Lorg/apache/spark/sql/types/DataType;)Lorg/apache/spark/sql/types/StructType;
        at org.apache.spark.ml.SchemaTest$.main(SchemaTest.scala:17)

when running this minimal example on my cluster (inspired by the library I ultimately want to use):

package org.apache.spark.ml

import org.apache.spark.ml.util.SchemaUtils
import org.apache.spark.sql.types._
import org.apache.spark.mllib.linalg.VectorUDT

object SchemaTest {

  def main(args: Array[String]): Unit = {
    val schema: StructType =
      StructType(
        StructField("a", IntegerType, true) :: StructField("b", LongType, false) :: Nil
      )

    val transformed = SchemaUtils.appendColumn(schema, "test", new VectorUDT())

  }
}

However, the same example launched locally on my desktop runs without problems.

From what I saw online (for example here), this kind of error message is often linked to a version mismatch between compilation and runtime environements, but my program, my local spark distribution, and my cluster distribution all have the same Spark & mllib versions v1.6.0, the same Scala version v2.10.6, and the same Java version v7.

I checked the Spark 1.6.0 source code and the appendColumn does exist in org.apache.spark.ml.util.SchemaUtils, with the right signature (but SchemaUtils is not mentionned in the org.apache.spark.ml.util API documentation).

ETA: Extract from my pom.xml file:

 <dependencies>
    <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-library</artifactId>
        <version>2.10.6</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.10</artifactId>
        <version>1.6.0</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.10</artifactId>
        <version>1.6.0</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-mllib_2.10</artifactId>
        <version>1.6.0</version>
        <scope>provided</scope>
    </dependency>

</dependencies>
Community
  • 1
  • 1
datasock
  • 1
  • 2

1 Answers1

0

You need to check all the places where you can change the classpath for the jobs running on the cluster. For example, these properties:

  • spark.driver.extraClassPath
  • spark.driver.userClassPathFirst
  • spark.executor.extraClassPath
  • spark.executor.userClassPathFirst

You also should examine your assembly process to ensure that the dependencies you think you're packaging are in fact what you're packaging.

Vidya
  • 29,932
  • 7
  • 42
  • 70
  • The conf files on my local spark are empty and the Environement tab of the Spark UI on the cluster does not show any parameters for ClassPath, so I assume both versions use the values by default. My packaging also seems to have the correct versions for all (I added the pom dependencies to my OP). – datasock Mar 17 '17 at 09:40