I don't think _SUCCESS
is a directory and the other history
directory resides inside _logs
directory.
First of all TextOutputFormat.setOutputPath(job1,new Path(outputPath));
is important because when the job is run this path is taken as a work directory by Hadoop to create temporary files and such for different tasks (_temporary dir). This _temporary directory and files eventually gets deleted at the end of the job. The file _SUCCESS and history directory is actually what remains under work directory and kept after the job has finished successfully. _SUCCESS file is kind of a flag saying the job actually ran successfully. Please look at this link.
The creation of your file _SUCCESS is done by TextOutputFormat
class, you are actually using, which in turn uses FileOutputComitter
class. That FileOutputCommiter class defines a function such as this --
public static final String SUCCEEDED_FILE_NAME = "_SUCCESS";
/**
* Delete the temporary directory, including all of the work directories.
* This is called for all jobs whose final run state is SUCCEEDED
* @param context the job's context.
*/
public void commitJob(JobContext context) throws IOException {
// delete the _temporary folder
cleanupJob(context);
// check if the o/p dir should be marked
if (shouldMarkOutputDir(context.getConfiguration())) {
// create a _success file in the o/p folder
markOutputDirSuccessful(context);
}
}
// Mark the output dir of the job for which the context is passed.
private void markOutputDirSuccessful(JobContext context)
throws IOException {
if (outputPath != null) {
FileSystem fileSys = outputPath.getFileSystem(context.getConfiguration());
if (fileSys.exists(outputPath)) {
// create a file in the folder to mark it
Path filePath = new Path(outputPath, SUCCEEDED_FILE_NAME);
fileSys.create(filePath).close();
}
}
}
Since, markOutputDirSuccessful() is private, you have to instead override commitJob() to bypass the SUCCEEDED_FILE_NAME creation process and achieve what you want.
The next directory _logs is very important if you want to later use hadoop HistoryViewer to actually get a report of how the Job ran.
I think, when you use the same output directory as an input to another Job, the file *_SUCCESS* and directory *_logs* will get ignored due to Filter set in Hadoop.
Moreover, when you define a namedoutput for MultipleOutputs, you can instead write to a subdirectory inside the outpath you mentioned in the TextOutputFormat.setOutputPath() function and then use that path as input to the next job you'll be running.
I don't actually see how _SUCCESS and _logs will every bother you?
Thanks