For a while now I've been going through the log4j
logs trying to determine why my Hadoop job is crashing.
Essentially what the job attempts to do is issue a command on the underlying machine, and collect the output of that command- at the moment point all of these steps takes place in a map job (later I'll try to reduce over the sum of those individual outputs).
The behaviour I'm experiencing is that- for a certain number of outputs generated to the BufferedReader
, for sake of conversation- 28
of them, everything works fine and the job finishes almost immediately, however when I increase that number to 29
, the map job hangs at 67%
completion- attempts it three times- always stopping at 67%
and finally terminates itself for lack of progress.
From the NameNode where the job is issued we can see the following output:
17/10/09 15:19:29 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
17/10/09 15:19:29 INFO input.FileInputFormat: Total input files to process : 1
17/10/09 15:19:30 INFO mapreduce.JobSubmitter: number of splits:1
17/10/09 15:19:30 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1507562353923_0001
17/10/09 15:19:30 INFO impl.YarnClientImpl: Submitted application application_1507562353923_0001
17/10/09 15:19:30 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1507562353923_0001/
17/10/09 15:19:30 INFO mapreduce.Job: Running job: job_1507562353923_0001
17/10/09 15:19:36 INFO mapreduce.Job: Job job_1507562353923_0001 running in uber mode : false
17/10/09 15:19:36 INFO mapreduce.Job: map 0% reduce 0%
17/10/09 15:19:53 INFO mapreduce.Job: map 67% reduce 0%
17/10/09 15:30:05 INFO mapreduce.Job: Task Id : attempt_1507562353923_0001_m_000000_0, Status : FAILED
AttemptID:attempt_1507562353923_0001_m_000000_0 Timed out after 600 secs
Sent signal OUTPUT_THREAD_DUMP (SIGQUIT) to pid 6230 as user ubuntu for container container_1507562353923_0001_01_000002, result=success
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
17/10/09 15:30:06 INFO mapreduce.Job: map 0% reduce 0%
17/10/09 15:30:26 INFO mapreduce.Job: map 67% reduce 0%
17/10/09 15:40:36 INFO mapreduce.Job: Task Id : attempt_1507562353923_0001_m_000000_1, Status : FAILED
AttemptID:attempt_1507562353923_0001_m_000000_1 Timed out after 600 secs
17/10/09 15:40:37 INFO mapreduce.Job: map 0% reduce 0%
17/10/09 15:40:52 INFO mapreduce.Job: map 67% reduce 0%
17/10/09 15:51:05 INFO mapreduce.Job: Task Id : attempt_1507562353923_0001_m_000000_2, Status : FAILED
AttemptID:attempt_1507562353923_0001_m_000000_2 Timed out after 600 secs
17/10/09 15:51:06 INFO mapreduce.Job: map 0% reduce 0%
17/10/09 15:51:24 INFO mapreduce.Job: map 67% reduce 0%
17/10/09 16:01:37 INFO mapreduce.Job: map 100% reduce 100%
17/10/09 16:01:37 INFO mapreduce.Job: Job job_1507562353923_0001 failed with state FAILED due to: Task failed task_1507562353923_0001_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
17/10/09 16:01:37 INFO mapreduce.Job: Counters: 13
Job Counters
Failed map tasks=4
Killed reduce tasks=1
Launched map tasks=4
Other local map tasks=3
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=5025782
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=2512891
Total time spent by all reduce tasks (ms)=0
Total vcore-milliseconds taken by all map tasks=2512891
Total vcore-milliseconds taken by all reduce tasks=0
Total megabyte-milliseconds taken by all map tasks=10292801536
Total megabyte-milliseconds taken by all reduce tasks=0
To determine the root cause of this problem I've spent many hours meticulously pouring over the log4j
output logs, but they don't turn up anything conclusive- for instance- they look almost identical for sucessful jobs and unsuccessful jobs, with the minor caveat that unsucessful jobs, abruptly terminate like so:
The most interesting part seems to be here (from one of the previous attempts(Hadoop always makes 3 tries at completing the job)):
2017-10-09 15:40:35,821 WARN [main] OpcodeCount$TokenizerMapper: Code: opAdd, Time: 214
2017-10-09 15:40:35,821 WARN [main] OpcodeCount$TokenizerMapper: Code: opAdd, Time: 450
2017-10-09 15:40:35,821 WARN [main] OpcodeCount$TokenizerMapper: Code: opAdd, Time: 217
2017-10-09 15:40:35,821 WARN [main] OpcodeCount$TokenizerMapper: Code: opStop, Time: 165
2017-10-09 15:40:35,851 INFO [main] org.apache.hadoop.mapred.MapTask: Starting flush of map output
2017-10-09 15:40:35,851 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
2017-10-09 15:40:35,851 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 1977; bufvoid = 104857600
2017-10-09 15:40:35,851 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
(&container_1507562353923_0001_01_000005???stderr0stdout105912017-10-09 16:01:35
Full thread dump OpenJDK 64-Bit Server VM (25.131-b11 mixed mode):
The word "Full thread" lead me to believe that my MapOutputBuffer
was filling up, as described in this article, I tried the recommended to solution of increasing the size in my configuration files, i.e. increasing mapreduce.task.io.sort.mb
, to 1000
, and mapreduce.map.sort.spill.percent
to 0.99
, but it had no positive effect.
The next thing I tried was to change the program from outputting the debug/sanity check output in the form of a Log
from log4j, and instead have it write that to a file as decribed here, but that also didn't work.
At the moment my ideas of what could be wrong with this program are as follows:
1) MapOutputBuffer
is filling up causing the program to stall out
Something I should mention is that- it's important for one file to execute as an atomic map task- if the command from the file gets broken up into smaller component pieces, the operation that it issues to the underlying OS won't make sense- therefore:
2) The input file is broken up inside of HDFS and not being read in as one sequential unit- in the correct order
3) That the command is being broken up between containers and that part of the command is being issued in one container, while another container takes care of the other part- thereby causing the issue I just specified above.
So then as the logs have been inconclusive my next goal is to change my code in some way such that I can isolate, test for, and eliminate each one of these possibilities --> any ideas on how I can do that in an efficient way- or any and all comments, insights, remarks about anything else I've written here will be much appreciated.
Here's how the map job is written at the moment:
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
// declare logger
private final Logger LOG = org.apache.log4j.Logger.getLogger(this.getClass());
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
// output data struct
List<Map.Entry<String,Integer>> pairList = new ArrayList<>();
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
try {
// command execution
Runtime rt = Runtime.getRuntime();
String evmDir = "/home/ubuntu/go/src/github.com/ethereum/go-ethereum/build/bin/evm";
String command = evmDir + " --debug --code " + value.toString() + " run";
Process proc = Runtime.getRuntime().exec(command);
LOG.warn(command);
BufferedReader stdInput = new BufferedReader(new InputStreamReader(proc.getInputStream()));
BufferedReader stdError = new BufferedReader(new InputStreamReader(proc.getErrorStream()));
// define and initialise representation to hold 'evm' command output
ArrayList<String> consoleOutput = new ArrayList<String>();
ArrayList<String> debugOutput = new ArrayList<String>();
String s = null;
while ((s = stdInput.readLine()) != null) {
consoleOutput.add(s);
}
while ((s = stdError.readLine()) != null) {
debugOutput.add(s);
}
for (String p : consoleOutput) {
Pattern pattern = Pattern.compile("([A-Za-z]+)([ \t]+)(\\d+)");
Matcher matcher = pattern.matcher(p);
while (matcher.find()) {
String opcodeName = matcher.group(1);
Integer executionStepTime = Integer.valueOf(matcher.group(3));
// add component pieces of line to output data structure
pairList.add(new AbstractMap.SimpleEntry<>(opcodeName, executionStepTime));
}
}
} catch (IOException e) {
//LOG.warn(e);
LOG.warn("Exception Encountered! " + e);
}
// log output for debugging
for (Map.Entry<String, Integer> entry : pairList) {
String opcodeRep = entry.getKey().toString();
Integer stepTime = entry.getValue();
LOG.warn("Code: " + opcodeRep + ", Time: " + stepTime);
}
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Maybe this addtional information from the logs looks like it could be pertinent to correctly diagnosing this issue:
"VM Thread" os_prio=0 tid=0x00007f0a1007a000 nid=0x7328 runnable
"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f0a10026000 nid=0x7326 runnable
"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f0a10027800 nid=0x7327 runnable
"VM Periodic Task Thread" os_prio=0 tid=0x00007f0a100bc800 nid=0x732f waiting on condition
JNI global references: 277
Heap
PSYoungGen total 100352K, used 7502K [0x0000000780000000, 0x0000000788a00000, 0x00000007c0000000)
eden space 94720K, 2% used [0x0000000780000000,0x0000000780216d78,0x0000000785c80000)
from space 5632K, 95% used [0x0000000788480000,0x00000007889bcbc8,0x0000000788a00000)
to space 8192K, 0% used [0x0000000787a00000,0x0000000787a00000,0x0000000788200000)
ParOldGen total 143360K, used 113628K [0x0000000700000000, 0x0000000708c00000, 0x0000000780000000)
object space 143360K, 79% used [0x0000000700000000,0x0000000706ef71d0,0x0000000708c00000)
Metaspace used 25981K, capacity 26224K, committed 26496K, reserved 1073152K
class space used 3019K, capacity 3083K, committed 3200K, reserved 1048576K