Java heap space in spark mllib

Question

I have the following code which runs computes some metrics by cross-validation for a random forest classification.

def run(data:RDD[LabeledPoint], metric:String = "PR") = {

    val cv_data:Array[(RDD[LabeledPoint], RDD[LabeledPoint])] = MLUtils.kFold(data, numFolds, 0)

    val result : Array[(Double, Double)] = cv_data.par.map{case (training, validation) =>
      training.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)
      validation.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)

      val res :ParArray[(Double, Double)] = CV_params.par.zipWithIndex.map { case (p,i) =>
        // Training classifier
        val model = RandomForest.trainClassifier(training, numClasses, categoricalFeaturesInfo, params(0).asInstanceOf[Int], params(3).asInstanceOf[String], params(4).asInstanceOf[String],
  params(1).asInstanceOf[Int], params(2).asInstanceOf[Int])
        // Prediction
        val labelAndPreds:RDD[(Double, Double)] = model.predictWithLabels(validation)
        // Metrics computation
        val bcm = new BinaryClassificationMetrics(labelAndPreds)
        (bcm.areaUnderROC() / numFolds, bcm.areaUnderPR() / numFolds)
      }

      training.unpersist()
      validation.unpersist()
      res
    }.reduce((s1,s2) => s1.zip(s2).map(t => (t._1._1 + t._2._1, t._1._2 + t._2._2))).toArray

    val cv_roc = result.map(_._1)
    val cv_pr = result.map(_._2)

    // Extract best params
    val which_max = (metric match {
      case "ROC" => cv_roc
      case "PR" => cv_pr
      case _ =>
        logWarning("Metrics set to default one: PR")
        cv_pr
    }).zipWithIndex.maxBy(_._1)._2

    best_values_array = CV_params(which_max)
    CV_areaUnderROC = cv_roc
    CV_areaUnderPR = cv_pr
  }
}

val numTrees = Array(50)
val maxDepth = Array(30)
val maxBins = Array(100)
val featureSubsetStrategy = Array("sqrt")
val impurity = Array("gini")

val CV_params: Array[Array[Any]] = {
    for (a <- numTrees; b <- maxDepth; c <- maxBins; d <- featureSubsetStrategy;
         e <- impurityString) yield Array(a, b, c, d, e)
}

run(data, "PR")

It runs on a YARN cluster on 50 containers (26GB of memory in total). the data parameter is an RDD[LabeledPoint]. I use kryo serialization and a default level of parallelism of 1000.

For a low size of data, it works but for my real data of size 600 000, I obtain the following error:

Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1841)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1533)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)

I can't figure where the error comes from, because the total allocated memory (26GB) is much higher than the consumed one during the job (I have checked on the spark web UI).

Any help would be appreciated. Thank you!

Try to unpersist the models you are training also, you don't need it,or it will stay in memory. once you get your best hyper-parameter by cross validation, you can try it using the best params that fits. — eliasah, Jan 06 '16 at 09:17
My bad, actually you can't unpersist the model from an RF classifier, but just for a MatrixFactorizationModel... — eliasah, Jan 06 '16 at 09:28
Have you tried using MEMORY_AND_DISK_SER as the StorageLevel? — jbrown, Jan 06 '16 at 16:31
No. I will try. But I do not think that the out of memory problem comes from the RDD, as their total weight is lower than 1 GB in my case. — Pop, Jan 06 '16 at 16:41
It would be helpful to have more of the stack trace. I realize that it probably repeats those serialization stuff about "java.io.ObjectOutputStream" a bunch, but there may be more at the end of the stack. Eg., I'm wondering whether this is during task serialization, or while processing task results, etc. — Imran Rashid, Feb 07 '17 at 16:31

score 0 · Answer 1 · answered Feb 07 '17 at 16:36

Just a guess, but one unusual thing you are doing is submitting many jobs in parallel with your call to .par. Note that spark normally achieves parallelism a different way -- you submit one job, but that job is broken into a number of tasks which can be run in parallel.

There is nothing wrong, in principle, with what you are doing, this can be useful if the parallelization within one job is small; in that case, you wouldn't be making effective use of the cluster if you submitted one job at a time. OTOH, just using .par may result in too many jobs being submitted in parallel. That convenience method will keep submitting jobs to try to keep the driver busy (to a first approximation anyway); but in fact, in spark its not unusual for the driver to be relatively idle while its waiting for your cluster to do the heavy lifting. So while the driver may have plenty of cpu available, its possible its using a lot of memory just because its doing the book-keeping required for preparing 1000 jobs simultaneously (not sure how many jobs you are generating).

If you do want to submit jobs in parallel, it may help to limit it to a small number, eg. only 2 or 4 jobs at a time.

Java heap space in spark mllib

1 Answers1