0

MRJob waits until each job completes before giving back control to the user. I broke down a large EMR step into smaller ones and would like to submit them all in one shot.

The docs talk about programmatically submitting tasks, but the sample code also waits for job completion (since they call the runner.run() command which blocks until the job is complete).

Also EMR has a limitation of 256 Active jobs, yet, how do we go about filling up those 256 jobs rather than looping and getting the output on the attached console.

Pykler
  • 14,565
  • 9
  • 41
  • 50

1 Answers1

0

After days of trying, the following is the best I could come up with.

My Initial Attempt, when I realised that a submitted job doesnt get culled when the terminal is detached, was to (in a bash script) submit and kill jobs. However, that didn't work very well because AWS throttles calls to EMR and hence some of the jobs were killed before being submitted.

Current Best Solution

from jobs import MyMRJob
import logging

logging.basicConfig(
    level=logging.INFO,
    format = '%(asctime)-15s %(levelname)-8s %(message)s',
)
log = logging.getLogger('submitjobs')

def main():
    cluster_id="x-MXMXMX"
    log.info('Cluster: %s', cluster_id)
    for i in range(10):
        n = '%04d' % i
        log.info('Adding job: %s', n)
        mr_job = MyMRJob(args=[
            '-r', 'emr',
            '--conf-path', 'mrjob.conf',
            '--no-output',
            '--output-dir', 's3://mybucket/mrjob/%s' % n,
            '--cluster-id', cluster_id,
            'input/file.%s' % n
    ])
    runner = mr_job.make_runner()
    # the following is the secret sauce, submits the job and returns
    # it is a private method though, so may be changed without notice
    runner._launch()

if __name__ == '__main__':
    main()
Pykler
  • 14,565
  • 9
  • 41
  • 50