Am trying to determine if there are certain hooks available in the hadoop api (hadoop 2.0.0 mrv1) to handle data skew for a reducer. Scenario : Have a custom Composite key and partitioner in place to route data to reducers. In order to deal with the odd case but very likely case of a million keys and large values ending up on the same reducer need some sort of heuristic so that this data can be further partitioned to spawn off new reducers. Am thinking of a two step process
- set mapred.max.reduce.failures.percent to say 10% and let the job complete
- rerun the job on the failed data set by passing a configuration thru the driver which will cause my partitioner to then randomly partition the skewed data. The partitioner will implement the Configurable interface.
Is there a better way/another way ?
Possible counter-solution may be to write output of mappers and spin off another map job doing the work of the reducer, but do not want to pressurize the namenode.