1

I'm trying to extend the hamorspam example(https://github.com/h2oai/sparkling-water/blob/master/examples/scripts/hamOrSpam.script.scala ) to make parallel predictions for large dataset using spark's parallel computation power(during the inference stage, not the training phase).

Below is the code I have written for the same. Moreover, it perfectly works fine in single node localmode (for export MASTER="local[*] ``), but fails when I run with export MASTER="local-cluster[2,2,1024] when 2 worker nodes are spawn.(to check the prediction parallelisation)

val data_test = load("smsData.txt") // Should be a large(in GBs) test dataset - using same training data for testing purposes just to test the workflow
val message_test = data.map( r => r(1))
message.take(1000).map(x => isSpam(x, dlModel, hashingTF, idfModel, h2oContext))

So the code fails when executing scala> val table:H2OFrame = resultRDD ( https://github.com/h2oai/sparkling-water/blob/master/examples/scripts/hamOrSpam.script.scala#L110)

I have attached the error from the console below:

 17/06/26 20:25:49 WARN TaskSetManager: Lost task 0.0 in stage 6.0 (TID 43, 144.27.27.98, executor 1): java.lang.NoClassDefFoundError: Could not ini
    tialize class $line32.$read$
            at $line41.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:57)
            at $line41.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:57)
            at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
            at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
            at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
            at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
            at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
            at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
            at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
            at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
            at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1010)
            at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1009)
            at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
            at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
            at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
            at org.apache.spark.scheduler.Task.run(Task.scala:99)
            at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:748)

17/06/26 20:25:49 ERROR TaskSetManager: Task 0 in stage 6.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
6.0 (TID 49, 144.27.27.98, executor 0): java.lang.NoClassDefFoundError: Could not initialize class 
        at $anonfun$1.apply(<console>:57)
        at $anonfun$1.apply(<console>:57)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1010)
        at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1009)
        at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
        at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1981)
  at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
  at org.apache.spark.h2o.utils.H2OSchemaUtils$.collectMaxArrays(H2OSchemaUtils.scala:229)
  at org.apache.spark.h2o.utils.H2OSchemaUtils$.expandedSchema(H2OSchemaUtils.scala:107)
  at org.apache.spark.h2o.converters.SparkDataFrameConverter$.toH2OFrame(SparkDataFrameConverter.scala:59)
  at org.apache.spark.h2o.H2OContext.asH2OFrame(H2OContext.scala:167)
  at org.apache.spark.h2o.H2OContextImplicits.asH2OFrameFromDataFrame(H2OContextImplicits.scala:54)
  ... 58 elided


Caused by: java.lang.NoClassDefFoundError: Could not initialize class 
  at $anonfun$1.apply(<console>:57)
  at $anonfun$1.apply(<console>:57)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
  at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1010)
  at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1009)
  at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
  at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:99)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:748)

Any ideas?. Thanks in advance.

siv
  • 31
  • 5
  • I'm not sure what is there to "extend". That code works (parallelized) as it is ! – eliasah Jul 01 '17 at 14:10
  • The error message suggests that there is some missing dependencies. – eliasah Jul 01 '17 at 14:11
  • What's the `spark-submit` you use to execute the app? I _suspect_ that is a CLASSPATH issue. – Jacek Laskowski Jul 01 '17 at 14:35
  • @JacekLaskowski I ran this in the shell (bin/sparkling-shell) - did not execute as an app yet. And yes, I suspect the same because since it runs in local mode(single machine). What are the possible ways of fixing it , if thats the case? – siv Jul 01 '17 at 15:56
  • Unfortunatelly, I've got no idea how to fix it. Sorry. – Jacek Laskowski Jul 01 '17 at 16:11

0 Answers0