After days of trying, the following is the best I could come up with.
My Initial Attempt, when I realised that a submitted job doesnt get culled when the terminal is detached, was to (in a bash script) submit and kill jobs. However, that didn't work very well because AWS throttles calls to EMR and hence some of the jobs were killed before being submitted.
Current Best Solution
from jobs import MyMRJob
import logging
logging.basicConfig(
level=logging.INFO,
format = '%(asctime)-15s %(levelname)-8s %(message)s',
)
log = logging.getLogger('submitjobs')
def main():
cluster_id="x-MXMXMX"
log.info('Cluster: %s', cluster_id)
for i in range(10):
n = '%04d' % i
log.info('Adding job: %s', n)
mr_job = MyMRJob(args=[
'-r', 'emr',
'--conf-path', 'mrjob.conf',
'--no-output',
'--output-dir', 's3://mybucket/mrjob/%s' % n,
'--cluster-id', cluster_id,
'input/file.%s' % n
])
runner = mr_job.make_runner()
# the following is the secret sauce, submits the job and returns
# it is a private method though, so may be changed without notice
runner._launch()
if __name__ == '__main__':
main()