0

I am using MRJob to run an iterative hadoop program on Amazon's EMR.

Everything works fine (but slowly) when I'm not using the "--pool-emr-job-flows" option. When I use this option,

Traceback (most recent call last):
  File "ic_bfs_eval.py", line 297, in <module>
    res = main()
  File "ic_bfs_eval.py", line 262, in main
    frac, mr_rounds = bfs(db_name, T, samples, total_steps_cap)
  File "ic_bfs_eval.py", line 183, in bfs
    runner.run()
  File "/Library/Python/2.7/site-packages/mrjob-0.4.3_dev-py2.7.egg/mrjob/runner.py", line 620, in __exit__
    self.cleanup()
  File "/Library/Python/2.7/site-packages/mrjob-0.4.3_dev-py2.7.egg/mrjob/emr.py", line 987, in cleanup
    super(EMRJobRunner, self).cleanup(mode=mode)
  File "/Library/Python/2.7/site-packages/mrjob-0.4.3_dev-py2.7.egg/mrjob/runner.py", line 566, in cleanup
    self._cleanup_job()
  File "/Library/Python/2.7/site-packages/mrjob-0.4.3_dev-py2.7.egg/mrjob/emr.py", line 1061, in _cleanup_job
    self._opts['ec2_key_pair_file'])
  File "/Library/Python/2.7/site-packages/mrjob-0.4.3_dev-py2.7.egg/mrjob/ssh.py", line 209, in ssh_terminate_single_job
    num_jobs_match = HADOOP_JOB_LIST_NUM_RE.match(job_list_lines[0])
IndexError: list index out of range

I am initializing an MRJob like so:

mrJob2 = MRBFSSampleIter(args=["-c", "~/mrjob.conf",
                                       "-r", "emr",
                                       "--no-output",
                                       "--output-dir", tmp_dir_out,
                                       "--pool-emr-job-flows", tmp_dir_in])

Any ideas on why this is happening?

JoelO
  • 101
  • 2

1 Answers1

1

This went away for me when I set up an ssh keypair. I think it's still a bug, since ssh is supposed to be optional. But the easiest workaround is just to set up the keypair as described at http://mrjob.readthedocs.org/en/latest/guides/emr-quickstart.html#configuring-ssh-credentials

Dan O'Huiginn
  • 364
  • 1
  • 6