0

I am attempting to produce a Spark Dataframe from within Spark, which has been initialised using apache Livy.

I first noticed this issue on this more complicated hbase call:

 import spark.implicits._

 ... 

        spark.sparkContext
          .newAPIHadoopRDD(
            conf,
            classOf[TableInputFormat],
            classOf[ImmutableBytesWritable],
            classOf[Result]
          )
          .toDF()

But i found i could get the same thing to occur on a simple:

 import spark.implicits._

  ...

  val filtersDf = filters.toDF() 

Where, filtersDf is just a sequence of case classes.

The common issue is the *.toDF(), however it also occurs with *.toDS(), which makes me think that the implicit resolution on import spark.implicits._ is not working. The underlying objects to be converted to dataframes do have data.

The error stack looks like it relates to runtime implicit resolution using scala runtime reflection.

Note that i have checked and both spark and the compiled code both use the same version of Scala (2.11) .

The exception i get is:

java.lang.RuntimeException: java.util.NoSuchElementException: head of empty list
scala.collection.immutable.Nil$.head(List.scala:420)
scala.collection.immutable.Nil$.head(List.scala:417)
scala.collection.immutable.List.map(List.scala:277)
scala.reflect.internal.Symbols$Symbol.parentSymbols(Symbols.scala:2117)
scala.reflect.internal.SymbolTable.openPackageModule(SymbolTable.scala:301)
scala.reflect.internal.SymbolTable.openPackageModule(SymbolTable.scala:341)
scala.reflect.runtime.SymbolLoaders$LazyPackageType$$anonfun$complete$2.apply$mcV$sp(SymbolLoaders.scala:74)
scala.reflect.runtime.SymbolLoaders$LazyPackageType$$anonfun$complete$2.apply(SymbolLoaders.scala:71)
scala.reflect.runtime.SymbolLoaders$LazyPackageType$$anonfun$complete$2.apply(SymbolLoaders.scala:71)
scala.reflect.internal.SymbolTable.slowButSafeEnteringPhaseNotLaterThan(SymbolTable.scala:263)
scala.reflect.runtime.SymbolLoaders$LazyPackageType.complete(SymbolLoaders.scala:71)
scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1514)
scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1.scala$reflect$runtime$SynchronizedSymbols$SynchronizedSymbol$$super$info(SynchronizedSymbols.scala:174)
scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anonfun$info$1.apply(SynchronizedSymbols.scala:127)
scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anonfun$info$1.apply(SynchronizedSymbols.scala:127)
scala.reflect.runtime.Gil$class.gilSynchronized(Gil.scala:19)
scala.reflect.runtime.JavaUniverse.gilSynchronized(JavaUniverse.scala:16)
scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$class.gilSynchronizedIfNotThreadsafe(SynchronizedSymbols.scala:123)
scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1.gilSynchronizedIfNotThreadsafe(SynchronizedSymbols.scala:174)
scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$class.info(SynchronizedSymbols.scala:127)
scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1.info(SynchronizedSymbols.scala:174)
scala.reflect.internal.Types$TypeRef.thisInfo(Types.scala:2194)
scala.reflect.internal.Types$TypeRef.baseClasses(Types.scala:2199)
scala.reflect.internal.tpe.FindMembers$FindMemberBase.<init>(FindMembers.scala:17)
scala.reflect.internal.tpe.FindMembers$FindMember.<init>(FindMembers.scala:219)
scala.reflect.internal.Types$Type.scala$reflect$internal$Types$Type$$findMemberInternal$1(Types.scala:1014)
scala.reflect.internal.Types$Type.findMember(Types.scala:1016)
scala.reflect.internal.Types$Type.memberBasedOnName(Types.scala:631)
scala.reflect.internal.Types$Type.member(Types.scala:600)
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:66)
scala.reflect.internal.Mirrors$RootsBase.staticPackage(Mirrors.scala:204)
scala.reflect.runtime.JavaMirrors$JavaMirror.staticPackage(JavaMirrors.scala:82)
scala.reflect.internal.Mirrors$RootsBase.init(Mirrors.scala:263)
scala.reflect.runtime.JavaMirrors$class.scala$reflect$runtime$JavaMirrors$$createMirror(JavaMirrors.scala:32)
scala.reflect.runtime.JavaMirrors$$anonfun$runtimeMirror$1.apply(JavaMirrors.scala:49)
scala.reflect.runtime.JavaMirrors$$anonfun$runtimeMirror$1.apply(JavaMirrors.scala:47)
scala.reflect.runtime.Gil$class.gilSynchronized(Gil.scala:19)
scala.reflect.runtime.JavaUniverse.gilSynchronized(JavaUniverse.scala:16)
scala.reflect.runtime.JavaMirrors$class.runtimeMirror(JavaMirrors.scala:46)
scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:16)
scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:16)

My working assumption is that I am missing a dependency or import and this is some kind of scala-ism.

I have yet to find any other references to this issue. Ultimately i think it is probably down to imports/dependencies but so far I can't quite see what it is. Any help greatly appreciated. I'm keen to know ways to fix the issue or alternatively to create data frames via less magical approaches than toDf().

Spark info:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.2.0-mapr-1901
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
ZenMasterZed
  • 203
  • 2
  • 8
  • Can you check the count of the rdd before toDF() and check the results? Like, spark.sparkContext.newAPIHadoopRDD(conf,classOf[TableInputFormat],classOf[ImmutableBytesWritable],classOf[Result]).count() – Apurba Pandey Mar 25 '19 at 05:46
  • I have checked and the rdd has data. But i don't think it is the RDD itself but more related to the `toDF` call which comes via an implicit. If you look closely at the error it is a scala reflection error. – ZenMasterZed Mar 25 '19 at 07:12

1 Answers1

0

I ran into this error on the same version of spark, however, I was seeing the error appear while reading a csv from hdfs, this being an example of what I was doing:

val csv: DataFrame = ss
  .read
  .option("header", "true")
  .option("mode", "DROPMALFORMED")
  .csv(filePath)
println(csv.count())

This is where I'm seeing the error originate from within spark.

I produced a small example of this failing and tried to isolate what was causing the issue. I was using the livy scala programmatic api to submit jobs to spark. I found that it was failing due to the types I passed as parameters to spark via livy, which makes sense as this is a scala reflection error.

For example this failed:

case class FailingJob(someSeq: Seq[String], filePath: String) {
...
def call(scalaJobContext: ScalaJobContext): Unit = {
    // It doesn't really matter what I do here.
// Main this is that the seq is used in some way.
val mappedSeq = someSeq.map(s => s.toUpperCase())

val ss: SparkSession = scalaJobContext.sparkSession

val csv: DataFrame = ss
  .read
  .option("header", "true")
  .option("mode", "DROPMALFORMED")
  .csv(filePath)

println(csv.count())
...
 for {
     _ <- livyClient.submit(FailingJob(someSeq, path).call)
...

Whereas this was successful:

case class SuccessfulJob(someArray: Array[String], filePath: String) {
...
def call(scalaJobContext: ScalaJobContext): Unit = {
    // It doesn't really matter what I do here.
// Main this is that the seq is used in some way.
val mappedSeq = someArray.map(s => s.toUpperCase())

val ss: SparkSession = scalaJobContext.sparkSession

val csv: DataFrame = ss
  .read
  .option("header", "true")
  .option("mode", "DROPMALFORMED")
  .csv(filePath)

println(csv.count())
...
 for {
     _ <- livyClient.submit(SuccessfulJob(someArray, path).call)
...

So if I pass in a param with the type Seq this fails, which makes me think that this is an issue with the serrialization/deserialization wihtin kryo. The other thing to note is that this error does not throw if I reference the value as a property on an object. I've tried upgrading to spark 2.4 with no luck. I'm using livy version 0.6.0-incubating. My work around currently is to transform the objects to use Array types rather than Seq. I suspect other scala specific types will also fail, although I have not tried this yet.

Here is my reproduction of this issue which includes the work around. I appreciate this doesn't answer the question but hopefully it helps someone else struggling debugging the issue in other scenarios. I've also submitted an issue with livy to see if they can give some more insight into what's going on.

Stephen
  • 1
  • 1