0

The Problem

I'm using a custom StoreFunc, OutputFormat, and OutputCommitter for use with Pig. The problem that I'm having is that Pig isn't calling some of the methods that I've defined in the OutputFormat that return the appropriate RecordWriter and OutputCommitter. This causes the data to be written somewhere else (I'm honestly not sure where) rather than the intended destination. Pig doesn't throw any errors during the job.

A simple example Pig script:

data = LOAD '<url>' USING com.company.CompanyLoader();
STORE data INTO '<other url>' USING com.company.CompanyStorage();

Here's what I know:

  • Pig calls the getOutputFormat method on the StoreFunc. So that's good.
  • Pig calls checkOutputSpecs on the OutputFormat. Great.
  • Pig does not call getOutputCommitter nor getRecordWriter on the OutputFormat. I'm logging what type of OutputFormat and OutputCommitter Pig is currently using in the checkOutputSpecs method, and the results are org.apache.hadoop.mapreduce.lib.output.TextOutputFormat and org.apache.hadoop.mapred.DirectFileOutputCommitter. I'm not sure this means that Pig is actually using either of these, since I figure Pig may substitute these with my own OutputFormat and OutputCommitter after calling the appropriate methods on my StoreFunc/OutputCommitter.

I'm not explicitly setting an OutputFormat nor OutputCommitter anywhere in PIG_OPTS. I suspect that Pig is picking up on other OutputFormat/other classes from somewhere else, but I'm out of ideas for where to check.

Can anyone lend any insight or further debugging steps in this matter?

Example Code

This isn't my actual code, but this demonstrates how I've set things up. I don't modify the configuration while running.

MyStoreFunc.java

public class MyStoreFunc extends StoreFunc {

    CustomOutputFormat OutputFormat = new CustomOutputFormat();
    private RecordWriter out;

    public OutputFormat getOutputFormat() throws IOException {
        LOG.info("getOutputFormat called.");
        return outputFormat;
    }

    public void prepareToWrite(final RecordWriter writer) throws IOException {
        out = writer;
        LOG.info("Using RecordWriter: " + writer.getClass());

        // other preparation
    }

    public void setStoreLocation(final String location, final Job job) {
        try {
            LOG.info("Output format class is set to: " + job.getOutputFormatClass());
        } catch (ClassNotFoundException e) {
            LOG.info("Output foramt class is undefined.");
        }
        LOG.info("Output committer is " + job.getConfiguration().get("mapred.output.committer.class", "undefined"));
        // other preparation
    }

    // other stuff...
}

CustomOutputFormat.java

public class CustomOutputFormat<K, V> extends OutputFormat<K, V> {

    public CustomOutputFormat() {
        LOG.info("CustomOutputFormat created.");
    }

    public void checkOutputSpecs(final JobContext context) throws IOException {
        LOG.info("checkOutputSpecs called.");
        try {
            LOG.info("output format = " + context.getOutputFormatClass());
        } catch (ClassNotFoundException e) {
            LOG.info("output format not found.");
        }
        // Check some stuff in configuration
    }

    public OutputCommitter getOutputCommitter(final TaskAttemptContext ctx) {
        LOG.info("getOutputCommitter called.");
        return new CustomOutputCommitter();
    }

    public RecordWriter<K, V> getRecordWriter(final TaskAttemptContext ctx) {
        LOG.info("getRecordWriter called.");
        return new CustomRecordWriter<K, V>();
    }

}

Example Logs

The log output coming from the custom classes looks like this:

2015-06-30 00:08:32,100 [main] INFO    CustomOutputFormat created.
2015-06-30 00:08:32,104 [main] INFO    Output format class is set to: class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
2015-06-30 00:08:32,110 [main] INFO    Output committer is org.apache.hadoop.mapred.DirectFileOutputCommitter
2015-06-30 00:08:32,120 [main] INFO    getOutputFormat called.
2015-06-30 00:08:32,124 [main] INFO    checkOutputSpecs called.
2015-06-30 00:08:32,135 [main] INFO    output format=class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
2015-06-30 00:08:32,140 [main] INFO    Output format class is set to: class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
2015-06-30 00:08:32,152 [main] INFO    Output committer is org.apache.hadoop.mapred.DirectFileOutputCommitter
2015-06-30 00:08:32,154 [JobControl] INFO    CustomOutputFormat created.
2015-06-30 00:08:32,156 [JobControl] INFO    Output format class is set to: class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
2015-06-30 00:08:32,159 [JobControl] INFO    Output committer is org.apache.hadoop.mapred.DirectFileOutputCommitter
2015-06-30 00:08:32,166 [JobControl] INFO    getOutputFormat called.
2015-06-30 00:08:32,169 [JobControl] INFO    checkOutputSpecs called.
2015-06-30 00:08:32,175 [JobControl] INFO    output format=class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat

Things that I notice:

  • getOutputFormat and getRecordWriter are never called on CustomOutputFormat. Why?
  • prepareToWrite is never called on MyStoreFunc. What?
  • checkOutputSpecs is called on CustomOutputFormat, so it's clear that Pig "knows about" this class and is getting it from MyStoreFunc.

Thank you in advance.

llovett
  • 1,449
  • 1
  • 12
  • 21

0 Answers0