0

Am trying to determine if there are certain hooks available in the hadoop api (hadoop 2.0.0 mrv1) to handle data skew for a reducer. Scenario : Have a custom Composite key and partitioner in place to route data to reducers. In order to deal with the odd case but very likely case of a million keys and large values ending up on the same reducer need some sort of heuristic so that this data can be further partitioned to spawn off new reducers. Am thinking of a two step process

  1. set mapred.max.reduce.failures.percent to say 10% and let the job complete
  2. rerun the job on the failed data set by passing a configuration thru the driver which will cause my partitioner to then randomly partition the skewed data. The partitioner will implement the Configurable interface.

Is there a better way/another way ?

Possible counter-solution may be to write output of mappers and spin off another map job doing the work of the reducer, but do not want to pressurize the namenode.

sunny
  • 824
  • 1
  • 14
  • 36

2 Answers2

2

This idea comes to my mind, I am not sure how good it is.

Lets say you are running the Job with 10 mappers currently, which is failing because of the data skewness. The idea is, you set the number of reducer to 15 and also define what the max number of (key,value) should go to one reducer from each mapper. You keep that information in a hash map in your custom partitioner class. Once a particular reducer reaches the limit, you start sending the next set of (key,value) pairs to another reducer from the extra 5 reducer which we have kept for handling the skewness.

YoungHobbit
  • 13,254
  • 9
  • 50
  • 73
  • Yes, one problem with this is say somehow my data is evenly split up so they end up in different mappers under the max number then I would still end up with this problem. The other problem may be the amount of keys which are stored in the map, which may be solvable by using a mru kind of policy ... – sunny Sep 17 '15 at 19:03
1

If you process allow it, The use of a Combiner (reduce-type function) could help you. If you pre-aggregate the data in the Mapper side . Then, even all your data end in the same reducer the amount of data could be manageable.

An alternative could be reimplement the partitioner to avoid the skew case.

RojoSam
  • 1,476
  • 12
  • 15
  • Should probably have added a little more detail. The job I am running incorporates secondary sort essentially to group data logically in the reducer i.e. this job is not an aggregation job. – sunny Sep 17 '15 at 20:43
  • Secondary sort is only to sort the values associated with one key. The combiner is like a local reducer in a mapper container. It aggregate all the values associated with the same key in ONLY one mapper and transfer the result to the reducer to allow the full aggregation of all mapper data. – RojoSam Sep 17 '15 at 20:50
  • Then instead to emit 1000 values in the form (key1,1), (key1,1), . . . (key1,1) from one mapper to the reducer, with the combiner the map only will emit a record with (key1,1000). Assuming your reducer is adding the values. Then even your data continue being skew, instead to receive 1 million of values in a reducer. It could receive only 100 with partial aggregations. With "addition" is very easy but not all aggregations allow partial aggregations. – RojoSam Sep 17 '15 at 20:56
  • In several cases the same reducer class is used as a combiner. – RojoSam Sep 17 '15 at 20:58
  • My reducer is an identity reducer. Its only function is to collect all logically grouped records. It is my outputformat which then dumps out the records. – sunny Sep 17 '15 at 21:29
  • You should include a data sample to let us understand your case. Use a combiner wouldn't work for you. If your values are grouped in several keys then you need to improve your partitioner to scatter them to more than one reducer. You need to redefine your key or/and your partitioner. If all the values belong to the same key, then you need to redefine your data model or strategy. – RojoSam Sep 19 '15 at 21:20