0

I am running a spark job with yarn, and my code written in java, now I want to execute a function to make some resource collect in every worker when the worker's job finished.

I tried mapPartitions() function, but there are many partitions run in the same worker, so the function will be executed several times.

Could I implement this and how ?

code updated:

 JavaRDD<String> sourceRDD = context.textFile(inputPath);
 sourceRDD.map(doSomething()); // every worker has it's env, I want to execute a function in every worker when map() ends.
 doResourceCollect(); // It runs in the final worker, so I can't get worker's env.
yiming xie
  • 31
  • 5
  • Share some of your code... I think you could broadcast a variable to mapPartitions and use that value to say, touch a file in Linux on local system and check exit code and if OK then do something else skip this post-processing. – thebluephantom Dec 05 '18 at 14:58
  • @thebluephantom I know I can set a flag in VM or OS to check if I executed this function, but I don't know how to trigger it at the end of the task. you know every job was spit into several stages and tasks. – yiming xie Dec 06 '18 at 02:18
  • Try this https://stackoverflow.com/questions/39947677/how-to-override-setup-and-cleanup-methods-in-spark-map-function. This is the closest approach i think as well with the extra logic as indeed many partitions on same worker are of course possible. – thebluephantom Dec 06 '18 at 10:48

0 Answers0