2

What is the best way to display execution time of a multi step map reduce job?

I tried to set a self variable in mapper init of step1 of of the job

    def mapper_init_timer(self):
      self.start= time.process_time()

But when I try to read this in reducer_final of Step2

def reducer_final_timmer(self):
    #self.start is None here
    MRJob.set_status(self,"total time")

I can't figure out why self veriable is lost between steps. And if that is by design then how can we calculate time of exection of a MrJob script that also gives correct result when run with -r hadoop.

Prabhash
  • 73
  • 9
  • You need the execution time of the entire job or each task? And display while it is running or once it is finished? – franklinsijo Feb 28 '17 at 19:01
  • I need to display the execution time of entire job (after job is finished). – Prabhash Feb 28 '17 at 19:05
  • Why not use Resourcemanager RestAPI? AFAIK, the job execution time is not exposed anywhere else. – franklinsijo Feb 28 '17 at 19:07
  • I am on a singlenode. Trying to find out if it is possible with with Python/MrJob first before diving into more advanced/external methods. Sure there will be more than one ways to achive this. – Prabhash Feb 28 '17 at 19:10
  • You cannot propagate a variable from mapper to reducer. One simple method would be to get the time value before invoking `run` and find the difference with the new time value once run is completed. But to find elapsed time of a job, REST API is the easiest method. – franklinsijo Feb 28 '17 at 19:14

1 Answers1

1

A simplest way would be get the time before and after invoking the run() and finding their difference,

from datetime import datetime
import sys

if __name__ == '__main__':
    start_time = datetime.now()
    MRJobClass.run()
    end_time = datetime.now()
    elapsed_time = end_time - start_time
    sys.stderr.write(elapsed_time)
franklinsijo
  • 17,784
  • 4
  • 45
  • 63
  • Already tried that. It runs for simple jobs but gives error on multi step jobs or jobs that run with -r hadoop command (error is like ... JSONDecodeError("Extra data", s, end) json.decoder.JSONDecodeError: Extra data ... ). – Prabhash Feb 28 '17 at 20:09
  • And if this code block is removed the job runs otherwise? – franklinsijo Feb 28 '17 at 20:11
  • Yup. However, I am doing a print(elapsed_time) as the last line (after your code block). – Prabhash Feb 28 '17 at 20:14
  • Checked again. without the last line as print(elapsed_time), I am not getting an error. Then in this case how to display the value of elapsed_time? – Prabhash Feb 28 '17 at 20:17
  • I suspect `mrjob` is confused when it receives a stdout outside of mapper, reducer methods. Try `sys.stderr.write(elapsed_time)` – franklinsijo Feb 28 '17 at 20:18
  • 2
    thank you that worked correctly on both local and hadoop runners. You are right, MrJob was getting confused with unexpected text on StdOut. As soon as the message was sent to stderr, it all worked fine. Thanks again. – Prabhash Mar 01 '17 at 10:17