Pig: Force UDF to occur in Reducer or set number of mappers

Question

I have a pig script that runs a very time consuming UDF. Pig appears to be setting the UDF to run as a map job instead of a reduce job. As a result, a suboptimally small number of mappers are getting created to run the job. I know I can set the default number of reducers to use in pig using setDefaultParallel as well as using the PARALELL x command in PigLatin to set the number of reducers for a given line. But what do I do to set the number of mappers? I've seen posts about increasing mapper count by defining my own InputSplit size, but I want to set the number of mappers explicitly to number of hosts * number of cores, filesize shouldn't have anything to do with it.

If I can't control the number of mappers, is there anyway to force my UDF to occur as a reducer since I can control those?

TC1 · Accepted Answer · 2013-04-01T12:39:46.467

No, you can not specify the number of mappers explicitly simply because Hadoop doesn't work that way. The number of mappers created is roughly total input size / input split size, but that might get skewed if you have tons of small files (which is discouraged because of how HDFS works). So basically, Pig doesn't let you do that because Hadoop doesn't have that option by definition.
No. Not with Pig explicitly, anyway. Also because "it doesn't work that way". Pig compiles & optimizes things for you, the output is a stream of MR jobs. Any hacks you do to force the UDF into a reducer can easily change when the next version of Pig comes out. If you feel like you really need the UDF in a reducer, you can create a custom MR job jar, implement a drop-through mapper in that and then do your work in the reducer. You call that from pig with the MAPREDUCE command. However, the solution sounds wrong and it's possible that you're misunderstanding something. You can look at what forces a reduce for Pig to get the big idea -- a DISTINCT, LIMIT and ORDER will always do so, a GROUP will usually do as well. A JOIN will usually get both a mapper and a reducer. As you can see, the ops that force a reduce are the ones that leverage some intrinsic characteristic of Hadoop (like ORDER being in reduce because the reducer input gets sorted). There is no easy way to sneak a UDF in there, since no type of UDF (eval, filter, load, store) goes easily together with a reducer.

score 0 · Answer 2 · answered Jul 01 '13 at 16:07

You can have some control in spawning more number of mappers using "mapred.max.split.size". Splitting works for certain input formats and compression formats. For example, GZ inputs are not splittable. Pig allows to combine smaller input files. Here is how combine small files

score 0 · Answer 3 · answered Jun 24 '16 at 05:44

As of the current Pig Version, this trick always works for me, The generate in a nested FOREACH after using a DISTINCT, LIMIT, ORDER always run as a reducer, for Eg,

A = FOREACH (GROUP DATA BY some_unique_field/all fields){
    LIMIT DATA.field 1;
    GENERATE udf.func(fields);
}

These also removes all the duplicate rows in the data.

Pig: Force UDF to occur in Reducer or set number of mappers

3 Answers3

Linked