I am using MrJob and trying to run a Hadoop job on Elastic Map Reduce which keeps crashing at random.
The data looks like this (tab separated):
279391888 261151291 107.303163 35.468534
279391888 261115099 108.511726 35.503008
279391888 261151290 104.881560 35.278487
279391888 261151292 109.732004 35.659141
279391888 261266862 108.507754 35.434581
279391888 1687590146 59.118796 19.931201
279391888 269450882 58.909985 19.914108
And the underlying MapReduce is very simple:
from mrjob.job import MRJob
import numpy as np
class CitypathsSummarize(MRJob):
def mapper(self, _, line):
orig, dest, minutes, dist = line.split()
minutes = float(minutes)
dist = float(dist)
if minutes < .001:
yield "crap", 1
else:
yield orig, dist/minutes
def reducer(self, orig, speeds):
speeds = list(speeds)
mean = np.mean(speeds)
yield orig, mean
if __name__ == "__main__":
CitypathsSummarize.run()
When I run it, I use the following command, using the default mrjob.conf (my keys are set in the environment):
$ python summarize.py -r emr --ec2-instance-type c1.xlarge --num-ec2-instances 4 s3://citypaths/chicago-v4/ > chicago-v4-output.txt
When I run it on small data sets, it finishes fine. When I run it on the whole data corpus (about 10GiB worth), I get errors like this (but not at the same point each time!):
Probable cause of failure (from s3://mrjob-093c9ef589d9f262/tmp/logs/j-KCPTKZR5OX6D/task-attempts/attempt_201301211911_0001_m_000151_3/syslog):
java.io.FileNotFoundException: /mnt2/var/lib/hadoop/mapred/taskTracker/jobcache/job_201301211911_0001/attempt_201301211911_0001_m_000018_4/output/spill0.out
(while reading from s3://citypaths/chicago-v4/1613640660)
Terminating job flow: j-KCPTKZR5OX6D
Traceback (most recent call last):
File "summarize.py", line 32, in <module>
CitypathsSummarize.run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 545, in run
mr_job.execute()
File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 561, in execute
self.run_job()
File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 631, in run_job
runner.run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/runner.py", line 490, in run
self._run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/emr.py", line 1048, in _run
self._wait_for_job_to_complete()
File "/usr/local/lib/python2.7/dist-packages/mrjob/emr.py", line 1830, in _wait_for_job_to_complete
raise Exception(msg)
Exception: Job on job flow j-KCPTKZR5OX6D failed with status SHUTTING_DOWN: Shut down as step failed
Probable cause of failure (from s3://mrjob-093c9ef589d9f262/tmp/logs/j-KCPTKZR5OX6D/task-attempts/attempt_201301211911_0001_m_000151_3/syslog):
java.io.FileNotFoundException: /mnt2/var/lib/hadoop/mapred/taskTracker/jobcache/job_201301211911_0001/attempt_201301211911_0001_m_000018_4/output/spill0.out
(while reading from s3://citypaths/chicago-v4/1613640660)
I've run this twice; the first time it died after 45 minutes, and this time it died after four hours. It has died on different files both times. I've checked both of the files it died on and neither has any problems.
Somehow it's failing to find the spill files that it writes, which is confusing me.
EDIT:
I ran the job again, and it died again after a few hours, this time with a different error message.
Probable cause of failure (from s3://mrjob-093c9ef589d9f262/tmp/logs/j-3GGW2TSIKKW5R/task-attempts/attempt_201301310511_0001_m_001810_0/syslog):
Status Code: 403, AWS Request ID: 9E9E748A55BC6A58, AWS Error Code: RequestTimeTooSkewed, AWS Error Message: The difference between the request time and the current time is too large., S3 Extended Request ID: Ky+HVYZ8RsC3l5f9N3LTwyorY9bbqEnc4tRD/r/xfAHYP/eiQrjjcpmIDNY2eoDo
(while reading from s3://citypaths/chicago-v4/1439606131)