0

I have a immutable map in my class. When I run my code in local mode, there is no problem and I can reach every key in the map. However, when I run my code in cluster mode, nodes throw error about not finding the key in the map.

What I've tried up to now are these;

-Broadcast the immutable map over cluster.

broadcast = sc.broadcast(my_immutable_map)

-Parallelize the map as pair RDD

my_map_rdd = sc.parallelize( my_immutable_map.toSeq) 

When i examine the logs, I see key not found exception. My error stacktrace is as follows:

Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 15.0 failed 4 times, most recent failure: Lost task 1.3 in stage 15.0 (TID 25, datanode1.big.com): java.util.NoSuchElementException: key not found: 905053199731
    at scala.collection.MapLike$class.default(MapLike.scala:228)
    at scala.collection.AbstractMap.default(Map.scala:58)
    at scala.collection.MapLike$class.apply(MapLike.scala:141)
    at scala.collection.AbstractMap.apply(Map.scala:58)
    at havelsan.CDRGenerator$.generate_random_target(CDRGenerator.scala:95)
    at havelsan.CDRGenerator$$anonfun$main$2$$anonfun$6.apply(CDRGenerator.scala:167)
    at havelsan.CDRGenerator$$anonfun$main$2$$anonfun$6.apply(CDRGenerator.scala:165)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1197)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1197)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1197)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1251)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1205)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Can you explain how spark distribute maps and how it is possible that some nodes can't find some keys in this map, please? Btw my spark version is 1.6.0

What am I missing?

UPDATE

This part is for initializing the map on driver.

...
    var pd = sc.textFile( "hdfs://...")
    my_immutable_map = pd.map( line => line.split(":") ).map{ line => (line(0), line(1).split(","))}.collectAsMap
... 

    broadcast = sc.broadcast(my_immutable_map)
    my_map_rdd = sc.parallelize( my_immutable_map.toSeq) 

And this is the part where i got the error.

def my_func(key:String):String={
...
    my_value = broadcast.value(key)
...
}

my_func is called inside a map as;

my_another_rdd.map{ line => 
val key = line.split(",")(0)
   my_func(key)
 }
Saygın Doğu
  • 305
  • 1
  • 4
  • 17

1 Answers1

0

The solution that i found is to pass the broadcast value to the function as a parameter. Still, I couldn't find a solution for parallelize method.

https://stackoverflow.com/a/34912887/4668959

Community
  • 1
  • 1
Saygın Doğu
  • 305
  • 1
  • 4
  • 17