60

I'm trying to implement a Hadoop Map/Reduce job that worked fine before in Spark. The Spark app definition is the following:

val data = spark.textFile(file, 2).cache()
val result = data
  .map(//some pre-processing)
  .map(docWeightPar => (docWeightPar(0),docWeightPar(1))))
  .flatMap(line => MyFunctions.combine(line))
  .reduceByKey( _ + _)

Where MyFunctions.combine is

def combine(tuples: Array[(String, String)]): IndexedSeq[(String,Double)] =
  for (i <- 0 to tuples.length - 2;
       j <- 1 to tuples.length - 1
  ) yield (toKey(tuples(i)._1,tuples(j)._1),tuples(i)._2.toDouble * tuples(j)._2.toDouble)

The combine function produces lots of map keys if the list used for input is big and this is where the exceptions is thrown.

In the Hadoop Map Reduce setting I didn't have problems because this is the point where the combine function yields was the point Hadoop wrote the map pairs to disk. Spark seems to keep all in memory until it explodes with a java.lang.OutOfMemoryError: GC overhead limit exceeded.

I am probably doing something really basic wrong but I couldn't find any pointers on how to come forward from this, I would like to know how I can avoid this. Since I am a total noob at Scala and Spark I am not sure if the problem is from one or from the other, or both. I am currently trying to run this program in my own laptop, and it works for inputs where the length of the tuples array is not very long.

starball
  • 20,030
  • 7
  • 43
  • 238
Augusto
  • 988
  • 1
  • 11
  • 19

5 Answers5

18

Add the following JVM arg when you launch spark-shell or spark-submit:

-Dspark.executor.memory=6g

You may also consider to explicitly set the number of workers when you create an instance of SparkContext:

Distributed Cluster

Set the slave names in the conf/slaves:

val sc = new SparkContext("master", "MyApp")
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
  • 1
    The memory option does not help unfortunately. I still get the Out of Memory exception. – Augusto Dec 13 '14 at 19:20
  • 2
    After I added this option and changing the minimum partitions to 100 `spark.textFile(conceptsFile, 100).cache()` it seems to run a lot longer but ends up dyeing with a `java.lang.OutOfMemoryError: Java heap space` – Augusto Dec 13 '14 at 20:00
  • If you change the spark.executor.memory=12g does it run even longer? How much memory do you have on the systems to allocate to the workers? You might want to add more workers - in the conf/slaves file - as well. – WestCoastProjects Dec 13 '14 at 20:17
  • 8
    hi @javadba it looks like it was actually the fact that I was trying to assemble the whole permutation array in memory that was killing it. Making the `combine` function return an iterator seemed to do the trick. Thanks for the time though! – Augusto Dec 13 '14 at 21:26
  • About use with Spark Shell 2.2, I added `-Dspark.executor.memory=6g`, "error: Invalid literal number" on "6g". Running script by `sshell -i script.scala` – Peter Krauss Oct 21 '19 at 23:10
17

In the documentation (http://spark.apache.org/docs/latest/running-on-yarn.html) you can read how to configure the executors and the memory limit. For example:

--master yarn-cluster --num-executors 10 --executor-cores 3 --executor-memory 4g --driver-memory 5g  --conf spark.yarn.executor.memoryOverhead=409

The memoryOverhead should be the 10% of the executor memory.

Edit: Fixed 4096 to 409 (Comment below refers to this)

Ganesh Krishnan
  • 7,155
  • 2
  • 44
  • 52
Carlos AG
  • 1,078
  • 1
  • 12
  • 16
  • 5
    Shouldn't 10% of 4G be around 410(M)? – piggybox Sep 07 '16 at 01:22
  • 1
    Yes, it should be around 10%, sorry. So: spark.yarn.executor.memoryOverhead=409 – Carlos AG Dec 05 '16 at 11:16
  • 1
    The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead. – K.S. Jul 18 '19 at 23:31
15

Adjusting the memory is probably a good way to go, as has already been suggested, because this is an expensive operation that scales in an ugly way. But maybe some code changes will help.

You could take a different approach in your combine function that avoids if statements by using the combinations function. I'd also convert the second element of the tuples to doubles before the combination operation:

tuples.

    // Convert to doubles only once
    map{ x=>
        (x._1, x._2.toDouble)
    }.

    // Take all pairwise combinations. Though this function
    // will not give self-pairs, which it looks like you might need
    combinations(2).

    // Your operation
    map{ x=>
        (toKey(x{0}._1, x{1}._1), x{0}._2*x{1}._2)
    }

This will give an iterator, which you can use downstream or, if you want, convert to list (or something) with toList.

mattsilver
  • 4,386
  • 5
  • 23
  • 37
  • Hi @ohruunuruus, I think the sliding does not provide the same behaviour of what I want to do. The `toKey` function simply combines the two strings. – Augusto Dec 13 '14 at 19:21
  • 1
    Hmm alright. I'll have to take a look when I've got access to a spark-shell. Regardless, you might explore other ways of performing the combine operation. – mattsilver Dec 13 '14 at 19:24
  • Hi @ohruunuruus, that did the trick, I think returning the iterator is what actually does the trick because my for loop tries to assemble a huge array in memory and it fails with out of memory. I knew my `for` loop didn't look very scala-like but I started learning yesterday so that was what I could do. thanks for the time! – Augusto Dec 13 '14 at 21:24
3

I had the same issue during long regression fit. I cached the train and test set. It solved my problem.

train_df, test_df = df3.randomSplit([0.8, 0.2], seed=142)
pipeline_model = pipeline_object.fit(train_df)

pipeline_model line was giving java.lang.OutOfMemoryError: GC overhead limit exceeded But when I used

train_df, test_df = df3.randomSplit([0.8, 0.2], seed=142)
train_df.cache()
test_df.cache()
pipeline_model = pipeline_object.fit(train_df)

It worked.

Erkan Şirin
  • 1,935
  • 18
  • 28
2

This JVM garbage collection error happened reproducibly in my case when I increased the spark.memory.fraction to values greater than 0.6 . So it is better to leave the value at it's default value to avoid JVM garbage collection errors. This is also recommended by https://forums.databricks.com/questions/2202/javalangoutofmemoryerror-gc-overhead-limit-exceede.html .

For more information one why 0.6 is the best value for spark.memory.fraction see https://issues.apache.org/jira/browse/SPARK-15796 .

asmaier
  • 11,132
  • 11
  • 76
  • 103