Spark using map in cluster mode

Question

I have a immutable map in my class. When I run my code in local mode, there is no problem and I can reach every key in the map. However, when I run my code in cluster mode, nodes throw error about not finding the key in the map.

What I've tried up to now are these;

-Broadcast the immutable map over cluster.

broadcast = sc.broadcast(my_immutable_map)

-Parallelize the map as pair RDD

my_map_rdd = sc.parallelize( my_immutable_map.toSeq)

When i examine the logs, I see key not found exception. My error stacktrace is as follows:

Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 15.0 failed 4 times, most recent failure: Lost task 1.3 in stage 15.0 (TID 25, datanode1.big.com): java.util.NoSuchElementException: key not found: 905053199731
    at scala.collection.MapLike$class.default(MapLike.scala:228)
    at scala.collection.AbstractMap.default(Map.scala:58)
    at scala.collection.MapLike$class.apply(MapLike.scala:141)
    at scala.collection.AbstractMap.apply(Map.scala:58)
    at havelsan.CDRGenerator$.generate_random_target(CDRGenerator.scala:95)
    at havelsan.CDRGenerator$$anonfun$main$2$$anonfun$6.apply(CDRGenerator.scala:167)
    at havelsan.CDRGenerator$$anonfun$main$2$$anonfun$6.apply(CDRGenerator.scala:165)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1197)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1197)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1197)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1251)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1205)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Can you explain how spark distribute maps and how it is possible that some nodes can't find some keys in this map, please? Btw my spark version is 1.6.0

What am I missing?

UPDATE

This part is for initializing the map on driver.

...
    var pd = sc.textFile( "hdfs://...")
    my_immutable_map = pd.map( line => line.split(":") ).map{ line => (line(0), line(1).split(","))}.collectAsMap
... 

    broadcast = sc.broadcast(my_immutable_map)
    my_map_rdd = sc.parallelize( my_immutable_map.toSeq)

And this is the part where i got the error.

def my_func(key:String):String={
...
    my_value = broadcast.value(key)
...
}

my_func is called inside a map as;

my_another_rdd.map{ line => 
val key = line.split(",")(0)
   my_func(key)
 }

Please provide more code. If the map is reasonably small, the first approach is the correct one. — Alexey Romanov, Sep 05 '16 at 10:59

score 0 · Accepted Answer · edited May 23 '17 at 12:08

0

The solution that i found is to pass the broadcast value to the function as a parameter. Still, I couldn't find a solution for parallelize method.

https://stackoverflow.com/a/34912887/4668959

edited May 23 '17 at 12:08

Community

1
1

answered Sep 05 '16 at 14:07

Saygın Doğu

305
1
4
17

Spark using map in cluster mode

1 Answers1