Hadoop removes MapReduce history when it is restarted

Question

I am carrying out several Hadoop tests using TestDFSIO and TeraSort benchmark tools. I am basically testing with different amount of datanodes in order to assess the linearity of the processing capacity and datanode scalability.

During the above mentioned process, I have obviously had to restart several times all Hadoop environment. Every time I restarted Hadoop, all MapReduce jobs are removed and the job counter starts again from "job_2013*_0001". For comparison reasons, it is very important for me to keep all the MapReduce jobs up that I have previously launched. So, my question is:

¿How can I avoid Hadoop removes all MapReduce-job history after it is restarted? ¿Is there some property to control job removing after Hadoop environment restarting?

Thanks!

score 0 · Answer 1 · answered Nov 25 '13 at 05:30

the MR job history logs are not deleted right way after you restart hadoop, the new job will be counted from *_0001 and only new jobs which are started after hadoop restart will be displayed on resource manager web portal though. In fact, there are 2 log related settings from yarn default:

# this is where you can find the MR job history logs
yarn.nodemanager.log-dirs = ${yarn.log.dir}/userlogs 

# this is how long the history logs will be retained
yarn.nodemanager.log.retain-seconds = 10800

and the default ${yarn.log.dir} is defined in $HADOOP_HONE/etc/hadoop/yarn-env.sh.

YARN_LOG_DIR="$HADOOP_YARN_HOME/logs"

BTW, similar settings could also be found in mapred-env.sh if you are use Hadoop 1.X

Thank you Zhutoulala for your answer. I though any body would give me an answer. I will test your suggestion! — VikBar, Jan 23 '14 at 17:47

Hadoop removes MapReduce history when it is restarted

1 Answers1

Linked