0

Follow up to Pig: Force UDF to occur in Reducer or set number of mappers . I have a UDF that's running as a map step in my pig workflow. It takes a list of X files, 1 per each reducer that saved it from a prior step. I want there to be X mappers (1 per input file) to run this UDF because it's very time consuming so Pig isn't running it as parallel as I want. Based on Hadoop streaming: single file or multi file per map. Don't Split I figured the solution was to prevent splitting so I made a pig Load Func like.

public class ForceMapperPerInputFile extends PigStorage {
    @Override
    public InputFormat getInputFormat() {
        return new MapperPerFileInputFormat();
    }
}
class MapperPerFileInputFormat extends PigTextInputFormat {
    @Override
    protected boolean isSplitable(JobContext context, Path file) {
       return false;
    }
}

When I used this it has the exact opposite effect of what I wanted, the number of mapper tasks decreased by nearly half.

How can I actually force exactly one mapper per input file?

Community
  • 1
  • 1
Manny
  • 195
  • 2
  • 13

1 Answers1

1

SET pig.noSplitCombination true;

(or -Dpig.noSplitCombination=true as one of the command-line options when running the script)

reo katoa
  • 5,751
  • 1
  • 18
  • 30
  • Apparently my input files were sufficiently skewed that this made things worse, but thanks this did force one mapper per input file. – Manny Apr 05 '13 at 20:39
  • Doesn't this result in one mapper per block though? For files larger than the block size, you'd still get more than 1 mapper right? How would I go about this if my files are for instance 5GB and I want to force them into one mapper? And I don't want to set the maxCombinedSplitSize to an arbitrary number... – Pieter-Jan Mar 25 '14 at 15:45