How to best decide mapper output/reducer input for a huge string

Question

I need to improve my MR jobs which uses HBase as source as well as sink..

Basically, i'm reading data from 3 HBase Tables in the mapper, writing them out as one huge string for the reducer to do some computation and dump into a HBase Table..

Table1 ~ 19 million rows.
Table2 ~ 2 million rows.
Table3 ~ 900,000 rows.

The output of the mapper is something like this :

HouseHoldId contentID name duration genre type channelId personId televisionID timestamp

This is for 1 row of Table1. Similarly 19 million mapper outputs.

I'm interested in sorting it on the basis of the HouseHoldID value so i'm using this technique. I'm not interested in the V part of pair so i'm kind of ignoring it. My mapper class is defined as follows:

public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }

For my MR job to be completed, it takes 22 hours to complete which is not desirable at all. I'm supposed to optimize this somehow to run a lot faster somehow..

scan.setCaching(750);        
scan.setCacheBlocks(false); 
TableMapReduceUtil.initTableMapperJob (
                                       Table1,           // input HBase table name
                                       scan,                   
                                       AnalyzeMapper.class,    // mapper
                                       Text.class,             // mapper output key
                                       IntWritable.class,      // mapper output value
                                       job);

TableMapReduceUtil.initTableReducerJob(
                                        OutputTable,                // output table
                                        AnalyzeReducerTable.class,  // reducer class
                                        job);
job.setNumReduceTasks(RegionCount);

My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a 8 node cloudera cluster.

Am i doing something wrong here?

Should i use a custom SortComparator or a Group Comparator or anything like that to make it more efficient?

How are you translating the HBase row into the output K,V type ``. Are you able to post you mapper's map method to give some more context. Other than sorting the ~22 million rows, what's the goal of your job? — Chris White, Sep 22 '13 at 15:24
@ChrisWhite http://stackoverflow.com/q/19056712/938959 Here is the detailed link.. Please give your insight! — Pavan, Sep 27 '13 at 17:38

How to best decide mapper output/reducer input for a huge string

0 Answers0