I need to improve my MR jobs which uses HBase as source as well as sink..
Basically, i'm reading data from 3 HBase Tables in the mapper, writing them out as one huge string for the reducer to do some computation and dump into a HBase Table..
Table1 ~ 19 million rows.
Table2 ~ 2 million rows.
Table3 ~ 900,000 rows.
The output of the mapper is something like this :
HouseHoldId contentID name duration genre type channelId personId televisionID timestamp
This is for 1 row of Table1. Similarly 19 million mapper outputs.
I'm interested in sorting it on the basis of the HouseHoldID value so i'm using this technique. I'm not interested in the V part of pair so i'm kind of ignoring it. My mapper class is defined as follows:
public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }
For my MR job to be completed, it takes 22 hours to complete which is not desirable at all. I'm supposed to optimize this somehow to run a lot faster somehow..
scan.setCaching(750);
scan.setCacheBlocks(false);
TableMapReduceUtil.initTableMapperJob (
Table1, // input HBase table name
scan,
AnalyzeMapper.class, // mapper
Text.class, // mapper output key
IntWritable.class, // mapper output value
job);
TableMapReduceUtil.initTableReducerJob(
OutputTable, // output table
AnalyzeReducerTable.class, // reducer class
job);
job.setNumReduceTasks(RegionCount);
My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a 8 node cloudera cluster.
Am i doing something wrong here?
Should i use a custom SortComparator or a Group Comparator or anything like that to make it more efficient?