The Problem
I'm using a custom StoreFunc
, OutputFormat
, and OutputCommitter
for use with Pig. The problem that I'm having is that Pig isn't calling some of the methods that I've defined in the OutputFormat
that return the appropriate RecordWriter
and OutputCommitter
. This causes the data to be written somewhere else (I'm honestly not sure where) rather than the intended destination. Pig doesn't throw any errors during the job.
A simple example Pig script:
data = LOAD '<url>' USING com.company.CompanyLoader();
STORE data INTO '<other url>' USING com.company.CompanyStorage();
Here's what I know:
- Pig calls the
getOutputFormat
method on theStoreFunc
. So that's good. - Pig calls
checkOutputSpecs
on theOutputFormat
. Great. - Pig does not call
getOutputCommitter
norgetRecordWriter
on the OutputFormat. I'm logging what type of OutputFormat and OutputCommitter Pig is currently using in thecheckOutputSpecs
method, and the results areorg.apache.hadoop.mapreduce.lib.output.TextOutputFormat
andorg.apache.hadoop.mapred.DirectFileOutputCommitter
. I'm not sure this means that Pig is actually using either of these, since I figure Pig may substitute these with my ownOutputFormat
andOutputCommitter
after calling the appropriate methods on myStoreFunc
/OutputCommitter
.
I'm not explicitly setting an OutputFormat
nor OutputCommitter
anywhere in PIG_OPTS. I suspect that Pig is picking up on other OutputFormat
/other classes from somewhere else, but I'm out of ideas for where to check.
Can anyone lend any insight or further debugging steps in this matter?
Example Code
This isn't my actual code, but this demonstrates how I've set things up. I don't modify the configuration while running.
MyStoreFunc.java
public class MyStoreFunc extends StoreFunc {
CustomOutputFormat OutputFormat = new CustomOutputFormat();
private RecordWriter out;
public OutputFormat getOutputFormat() throws IOException {
LOG.info("getOutputFormat called.");
return outputFormat;
}
public void prepareToWrite(final RecordWriter writer) throws IOException {
out = writer;
LOG.info("Using RecordWriter: " + writer.getClass());
// other preparation
}
public void setStoreLocation(final String location, final Job job) {
try {
LOG.info("Output format class is set to: " + job.getOutputFormatClass());
} catch (ClassNotFoundException e) {
LOG.info("Output foramt class is undefined.");
}
LOG.info("Output committer is " + job.getConfiguration().get("mapred.output.committer.class", "undefined"));
// other preparation
}
// other stuff...
}
CustomOutputFormat.java
public class CustomOutputFormat<K, V> extends OutputFormat<K, V> {
public CustomOutputFormat() {
LOG.info("CustomOutputFormat created.");
}
public void checkOutputSpecs(final JobContext context) throws IOException {
LOG.info("checkOutputSpecs called.");
try {
LOG.info("output format = " + context.getOutputFormatClass());
} catch (ClassNotFoundException e) {
LOG.info("output format not found.");
}
// Check some stuff in configuration
}
public OutputCommitter getOutputCommitter(final TaskAttemptContext ctx) {
LOG.info("getOutputCommitter called.");
return new CustomOutputCommitter();
}
public RecordWriter<K, V> getRecordWriter(final TaskAttemptContext ctx) {
LOG.info("getRecordWriter called.");
return new CustomRecordWriter<K, V>();
}
}
Example Logs
The log output coming from the custom classes looks like this:
2015-06-30 00:08:32,100 [main] INFO CustomOutputFormat created.
2015-06-30 00:08:32,104 [main] INFO Output format class is set to: class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
2015-06-30 00:08:32,110 [main] INFO Output committer is org.apache.hadoop.mapred.DirectFileOutputCommitter
2015-06-30 00:08:32,120 [main] INFO getOutputFormat called.
2015-06-30 00:08:32,124 [main] INFO checkOutputSpecs called.
2015-06-30 00:08:32,135 [main] INFO output format=class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
2015-06-30 00:08:32,140 [main] INFO Output format class is set to: class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
2015-06-30 00:08:32,152 [main] INFO Output committer is org.apache.hadoop.mapred.DirectFileOutputCommitter
2015-06-30 00:08:32,154 [JobControl] INFO CustomOutputFormat created.
2015-06-30 00:08:32,156 [JobControl] INFO Output format class is set to: class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
2015-06-30 00:08:32,159 [JobControl] INFO Output committer is org.apache.hadoop.mapred.DirectFileOutputCommitter
2015-06-30 00:08:32,166 [JobControl] INFO getOutputFormat called.
2015-06-30 00:08:32,169 [JobControl] INFO checkOutputSpecs called.
2015-06-30 00:08:32,175 [JobControl] INFO output format=class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
Things that I notice:
getOutputFormat
andgetRecordWriter
are never called onCustomOutputFormat
. Why?prepareToWrite
is never called on MyStoreFunc. What?checkOutputSpecs
is called onCustomOutputFormat
, so it's clear that Pig "knows about" this class and is getting it fromMyStoreFunc
.
Thank you in advance.