MapReduce job container killed by Google Cloud Platform [Error code:143]

Question

I tried to run a mapreduce job on a cluster in Google Cloud Platform using Python package mrjob as follows:

python mr_script.py -r dataproc --cluster-id [CLUSTER-ID] [gs://DATAFILE_FOLDER]

I can successfully run the very same script against the same data with correct results in Hadoop locally (with -r hadoop option). However, the same job failed in Google Cloud Platform after around 1 hour with the following error message:

Waiting for job completion - sleeping 10.0 second(s)
link_stats-vagrant-20170430-021027-125936---step-00001-of-00001 => RUNNING
Waiting for job completion - sleeping 10.0 second(s)
link_stats-vagrant-20170430-021027-125936---step-00001-of-00001 => RUNNING
Waiting for job completion - sleeping 10.0 second(s)
link_stats-vagrant-20170430-021027-125936---step-00001-of-00001 => ERROR
Step 1 of 1 failed

Upon inspection of log file in the worker node at Google Cloud Platform, I found the following error messages in /var/log/hadoop-yarn/yarn-yarn-nodemanager-mrjob-w-13.log:

2017-04-30 02:58:48,213 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3497 for container-id container_1493517964776_0001_01_004115: 447.5 MB of 10 GB physical memory used; 9.9 GB of 21 GB virtual memory used
2017-04-30 02:58:48,217 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3097 for container-id container_1493517964776_0001_01_001385: 351.7 MB of 10 GB physical memory used; 9.9 GB of 21 GB virtual memory used
2017-04-30 02:58:51,222 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3773 for container-id container_1493517964776_0001_01_006384: 349.3 MB of 10 GB physical memory used; 9.9 GB of 21 GB virtual memory used
2017-04-30 02:58:51,226 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3660 for container-id container_1493517964776_0001_01_005935: 344.8 MB of 10 GB physical memory used; 9.9 GB of 21 GB virtual memory used
2017-04-30 02:58:51,230 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3497 for container-id container_1493517964776_0001_01_004115: 447.5 MB of 10 GB physical memory used; 9.9 GB of 21 GB virtual memory used
2017-04-30 02:58:51,234 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3097 for container-id container_1493517964776_0001_01_001385: 351.7 MB of 10 GB physical memory used; 9.9 GB of 21 GB virtual memory used
2017-04-30 02:58:52,803 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1493517964776_0001_000001 (auth:SIMPLE)
2017-04-30 02:58:52,809 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1493517964776_0001_01_001385
2017-04-30 02:58:52,809 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root IP=10.142.0.20  OPERATION=Stop Container Request        TARGET=ContainerManageImpl      RESULT=SUCCESS  APPID=application_1493517964776_0001    CONTAINERID=container_1493517964776_0001_01_001385
2017-04-30 02:58:52,809 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1493517964776_0001_01_001385 transitioned from RUNNING to KILLING
2017-04-30 02:58:52,809 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1493517964776_0001_01_001385
2017-04-30 02:58:52,810 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1493517964776_0001_000001 (auth:SIMPLE)
2017-04-30 02:58:52,812 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1493517964776_0001_01_004115
2017-04-30 02:58:52,812 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root IP=10.142.0.20  OPERATION=Stop Container Request        TARGET=ContainerManageImpl      RESULT=SUCCESS  APPID=application_1493517964776_0001    CONTAINERID=container_1493517964776_0001_01_004115
2017-04-30 02:58:52,815 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1493517964776_0001_01_001385 is : 143
2017-04-30 02:58:52,821 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1493517964776_0001_000001 (auth:SIMPLE)
2017-04-30 02:58:52,823 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1493517964776_0001_01_006384
2017-04-30 02:58:52,823 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root IP=10.142.0.20  OPERATION=Stop Container Request        TARGET=ContainerManageImpl      RESULT=SUCCESS  APPID=application_1493517964776_0001    CONTAINERID=container_1493517964776_0001_01_006384
2017-04-30 02:58:52,826 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1493517964776_0001_01_004115 transitioned from RUNNING to KILLING
2017-04-30 02:58:52,826 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1493517964776_0001_01_001385 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
2017-04-30 02:58:52,826 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1493517964776_0001_01_006384 transitioned from RUNNING to KILLING
2017-04-30 02:58:52,826 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1493517964776_0001_01_004115

It appears my job was killed by the container manager, however it does not appear that my job was killed due to physical/logical memory has been exceeded (correct me if I'm wrong). I see there is an error code 143.

Could you please advise why my job failed, and how I can fix it or modify any settings in mrjob to make the job run successfully (if it's indeed a memory issue)? Or where else should I check for more clues to debug this issue? Thanks!

EDIT: Here is the failed job report (Under Cloud Dataproc -> Jobs, Status:Failed, Elapsed time: 48min 41sec):

Configuration:

Cluster     mrjob
Job type    Hadoop
Jar files   Main class or jar   
file:///usr/lib/hadoop-mapreduce/hadoop-streaming.jar
Arguments:  
-files
gs://mrjob-us-east1-bb4f10dbae4d77dc/tmp/link_stats.vagrant.20170504.185601.135018/files/link_stats.py#link_stats.py
-mapper
python link_stats.py --step-num=0 --mapper
-reducer
python link_stats.py --step-num=0 --reducer
-input
gs://vc1/data/wikipedia/english
-output
gs://mrjob-us-east1-bb4f10dbae4d77dc/tmp/link_stats.vagrant.20170504.185601.135018/output/

Output:

17/05/04 19:40:52 INFO mapreduce.Job:  map 74% reduce 6%
17/05/04 19:41:42 INFO mapreduce.Job:  map 75% reduce 6%
17/05/04 19:41:42 INFO mapreduce.Job: Task Id : attempt_1493924193762_0001_m_000481_0, Status : FAILED
AttemptID:attempt_1493924193762_0001_m_000481_0 Timed out after 600 secs
17/05/04 19:41:42 INFO mapreduce.Job: Task Id : attempt_1493924193762_0001_m_000337_2, Status : FAILED
AttemptID:attempt_1493924193762_0001_m_000337_2 Timed out after 600 secs
17/05/04 19:41:43 INFO mapreduce.Job:  map 74% reduce 6%
17/05/04 19:41:45 INFO mapreduce.Job:  map 75% reduce 6%
17/05/04 19:42:12 INFO mapreduce.Job: Task Id : attempt_1493924193762_0001_m_000173_2, Status : FAILED
AttemptID:attempt_1493924193762_0001_m_000173_2 Timed out after 600 secs
17/05/04 19:42:40 INFO mapreduce.Job:  map 76% reduce 6%
17/05/04 19:43:26 INFO mapreduce.Job:  map 77% reduce 6%
17/05/04 19:44:16 INFO mapreduce.Job:  map 78% reduce 6%
17/05/04 19:44:42 INFO mapreduce.Job:  map 100% reduce 100%
17/05/04 19:44:47 INFO mapreduce.Job: Job job_1493924193762_0001 failed with state FAILED due to: Task failed task_1493924193762_0001_m_000161
Job failed as tasks failed. failedMaps:1 failedReduces:0

17/05/04 19:44:47 INFO mapreduce.Job: Counters: 45
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=110101249
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        GS: Number of bytes read=8815472899
        GS: Number of bytes written=0
        GS: Number of read operations=0
        GS: Number of large read operations=0
        GS: Number of write operations=0
        HDFS: Number of bytes read=57120
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=560
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters 
        Failed map tasks=39
        Killed map tasks=192
        Killed reduce tasks=196
        Launched map tasks=685
        Launched reduce tasks=48
        Other local map tasks=38
        Rack-local map tasks=647
        Total time spent by all maps in occupied slots (ms)=1015831401
        Total time spent by all reduces in occupied slots (ms)=559653642
        Total time spent by all map tasks (ms)=338610467
        Total time spent by all reduce tasks (ms)=93275607
        Total vcore-milliseconds taken by all map tasks=338610467
        Total vcore-milliseconds taken by all reduce tasks=186551214
        Total megabyte-milliseconds taken by all map tasks=1040211354624
        Total megabyte-milliseconds taken by all reduce tasks=573085329408
    Map-Reduce Framework
        Map input records=594556238
        Map output records=560
        Map output bytes=35862346
        Map output materialized bytes=36523515
        Input split bytes=57120
        Combine input records=0
        Spilled Records=560
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=115614
        CPU time spent (ms)=272576210
        Physical memory (bytes) snapshot=268956098560
        Virtual memory (bytes) snapshot=2461942099968
        Total committed heap usage (bytes)=244810514432
    File Input Format Counters 
        Bytes Read=8815472899
17/05/04 19:44:47 ERROR streaming.StreamJob: Job not successful!
Streaming Command Failed!
Job output is complete

Please post a failed task log or application master log (found in the resource manager II). — Paul Back, May 02 '17 at 16:25
@PaulBack I just added the failed job log (see edit), hope that it helps? If this is not what you meant, could you please be more specific on where I can find that master log? I also checked "Stackdriver->Logging->Logs" on VM-level logs, but unfortunately there is no error logged. — vkc, May 04 '17 at 20:23
See how there's a line 'Job failed as tasks failed. failedMaps:1 failedReduces:0'? There is a task log for that failed mapper in your Resource Manager and/or Job History Server that will contain a stacktrace that will help diagnose the root cause of the error. Please paste said stacktrace. — Paul Back, May 04 '17 at 20:25
@PaulBack Sorry I'm new to Google Cloud Platform, could you show me exactly how can I navigate from the top menu to the stacktrace that u are referring to? — vkc, May 04 '17 at 20:35

score 0 · Answer 1 · answered May 02 '17 at 16:13

0

I'm only answering because I don't have enough points to comment, but you should try monitoring the status of each map task as it runs with the ResourceManager GUI.

It could be that one mapper is failing (e.g. a corrupted line causing unhandled exceptions) and mapreduce.map.failures.maxpercent=0, causing the job to terminate.

answered May 02 '17 at 16:13

Bill

57
5

I just added the failed job log (see edit), hope that it helps? If this is not what you meant, could you please be more specific on where I can find that master log? I also checked "Stackdriver->Logging->Logs" on VM-level logs, but unfortunately there is no error logged. – vkc May 04 '17 at 20:23
I'm sorry, I'm not familiar with DataProc as I do all my work on AWS. But here's some documentation on how to access the ResourceManager interface. https://cloud.google.com/dataproc/docs/concepts/cluster-web-interfaces . Through this interface, you will be able to view the status of each task as it runs and can access the logs for each task, which will give you insight as to why it's failing. This is how I do most of my debugging for MapReduce jobs. – Bill May 08 '17 at 14:44
Additionally, the 'timeout' error could be a result of running out of memory on your master node, so make sure you are using a large enough instance. But I recommend digging into the logs to find the exact cause. – Bill May 08 '17 at 14:49

score 0 · Answer 2 · answered May 09 '17 at 21:02

Error code 143 usually indicates OOME. You should use these options to set your memory sizes for mappers and reducers to figure out how much memory your application actually uses.

-Dmapreduce.map.memory.mb=1024 -Dmapreduce.reduce.memory.mb=1024

Another consideration is how your data is split. Sometimes your data is skewed and one mapper has 3X times the number of records than the rest of the mappers. You can determine this by looking at your data folder and making sure all the files are approximately equal.

MapReduce job container killed by Google Cloud Platform [Error code:143]

2 Answers2