Do away with default output directory completely - MapReduce

Question

I have a code for writing multiple outputs using org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.

The Reducer writes the results to a pre-created location so I don't require the default o/p directory (which contains the _history and _SUCCESS directories).

I have to delete them everytime before running my job again.

So I removed the TextOutputFormat.setOutputPath(job1,new Path(outputPath)); line. But, this gives me the (expected) error org.apache.hadoop.mapred.InvalidJobConfException: Output directory not set

Driver Class:

MultipleOutputs.addNamedOutput(job1, "path1", TextOutputFormat.class, Text.class,LongWritable.class);
MultipleOutputs.addNamedOutput(job1, "path2", TextOutputFormat.class, Text.class,LongWritable.class);
LazyOutputFormat.setOutputFormatClass(job1,TextOutputFormat.class);

Reducer Class:

if(condition1)
    mos.write("path1", key, new LongWritable(value), path_list[0]);
else
    mos.write("path2", key, new LongWritable(value), path_list[1]);

Is, there a workaround to avoid specifying a default output directory?

SSaikia_JtheRocker · Answer 1 · 2013-09-25T20:11:29.443

I don't think _SUCCESS is a directory and the other history directory resides inside _logs directory.

First of all TextOutputFormat.setOutputPath(job1,new Path(outputPath)); is important because when the job is run this path is taken as a work directory by Hadoop to create temporary files and such for different tasks (_temporary dir). This _temporary directory and files eventually gets deleted at the end of the job. The file _SUCCESS and history directory is actually what remains under work directory and kept after the job has finished successfully. _SUCCESS file is kind of a flag saying the job actually ran successfully. Please look at this link.

The creation of your file _SUCCESS is done by TextOutputFormat class, you are actually using, which in turn uses FileOutputComitter class. That FileOutputCommiter class defines a function such as this --

 public static final String SUCCEEDED_FILE_NAME = "_SUCCESS";
/**
   * Delete the temporary directory, including all of the work directories.
   * This is called for all jobs whose final run state is SUCCEEDED
   * @param context the job's context.
   */
  public void commitJob(JobContext context) throws IOException {
    // delete the _temporary folder
    cleanupJob(context);
    // check if the o/p dir should be marked
    if (shouldMarkOutputDir(context.getConfiguration())) {
      // create a _success file in the o/p folder
      markOutputDirSuccessful(context);
    }
  }

// Mark the output dir of the job for which the context is passed.
  private void markOutputDirSuccessful(JobContext context)
  throws IOException {
    if (outputPath != null) {
      FileSystem fileSys = outputPath.getFileSystem(context.getConfiguration());
      if (fileSys.exists(outputPath)) {
        // create a file in the folder to mark it
        Path filePath = new Path(outputPath, SUCCEEDED_FILE_NAME);
        fileSys.create(filePath).close();
      }
    }
  }

Since, markOutputDirSuccessful() is private, you have to instead override commitJob() to bypass the SUCCEEDED_FILE_NAME creation process and achieve what you want.

The next directory _logs is very important if you want to later use hadoop HistoryViewer to actually get a report of how the Job ran.

I think, when you use the same output directory as an input to another Job, the file *_SUCCESS* and directory *_logs* will get ignored due to Filter set in Hadoop.

Moreover, when you define a namedoutput for MultipleOutputs, you can instead write to a subdirectory inside the outpath you mentioned in the TextOutputFormat.setOutputPath() function and then use that path as input to the next job you'll be running.

I don't actually see how _SUCCESS and _logs will every bother you?

Thanks

score 2 · Answer 2 · edited May 23 '17 at 12:06

Question is pretty old, still sharing an answer,

This answer suits good for the scenario in the question.

Define your OutputFormat to say you are not expecting any output. You can do it this way:

job.setOutputFormat(NullOutputFormat.class);

or

you could also probably use LazyOutputFormat

import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat; 
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);

Credits @charlesmenguy

score 1 · Answer 3 · answered Sep 24 '13 at 19:22

1

What version of Hadoop are you running?

For a quick workaround you could set an output location programmatically and call FileSystem.delete to remove it on job completion.

answered Sep 24 '13 at 19:22

joews

29,767
10
79
91

I am using CDH4. That's what I do currently. I just wanted to know if there is a way to tweak it not to write at all. – Suvarna Pattayil Sep 25 '13 at 08:24

Do away with default output directory completely - MapReduce

3 Answers3