I have a pig script that runs a very time consuming UDF. Pig appears to be setting the UDF to run as a map job instead of a reduce job. As a result, a suboptimally small number of mappers are getting created to run the job. I know I can set the default number of reducers to use in pig using setDefaultParallel
as well as using the PARALELL x
command in PigLatin to set the number of reducers for a given line. But what do I do to set the number of mappers? I've seen posts about increasing mapper count by defining my own InputSplit size, but I want to set the number of mappers explicitly to number of hosts * number of cores, filesize shouldn't have anything to do with it.
If I can't control the number of mappers, is there anyway to force my UDF to occur as a reducer since I can control those?