2

I want to perform query operation in HBase to fetch records using provided list of row keys. Since Mappers in MapReduce work in parallel, so I want to use it.

Input List of row keys will be in the range of ~100000 and I have created a customInputFormat for mapper, that will give list of 1000 row keys to each mapper for querying HBase table. These queried records may or may not be present in HBase table, I want to return only those records that are present.

I have seen various examples, and what I found is that hbase table scan operation is performed to get range of rowkeys and range is specified by startingRowKey and endingRowKey, but I want to query for provided list of row keys only.

How can I do this with MapReduce? Any help is welcomed!

Community
  • 1
  • 1
hp36
  • 269
  • 1
  • 6
  • 20

2 Answers2

1

As you pass list of row keys to your mapper then you should make get requests to HBase. Every get returns data for the requested key or nothing if key doesn't exist.

First of all you should create Table instance in the setup() method of your mapper:

private Table table;

@Override
protected void setup(Context context) throws IOException, InterruptedException {
    Configuration hbaseConfig = HBaseConfiguration.create();
    Connection conn = ConnectionFactory.createConnection(hbaseConfig);
    this.table = conn.getTable(TableName.valueOf("hbaseTable"));
}

Then you can make get requests to HBase table from map() method per every key with the help of Get and Result instances:

String key = "keyString";
Get getValue = new Get(key.getBytes());

//add column family and column qualifier if you desire
getValue.addColumn("columnFamily".getBytes(), "columnQual".getBytes());

try {
    Result result = table.get(getValue);
    if (!table.exists(getValue)) {

        //requested key doesn't exist
        return;
    }

    // do what you want with result instance 
}

And after finish of mapper's work you need to close connection to the table in the cleanup() method;

@Override
protected void cleanup(Context context) throws IOException, InterruptedException {
    table.close();
}

Moreover you are free to pass results of get requests to the reducers or use cleanup() method to combine them. It depends on your purposes only.

maxteneff
  • 1,523
  • 12
  • 28
  • Thanks @maxteneff for help. As u mentioned that I should make `get()` request to HBase table per key in `map()` method. Since Mapper already have 1000 row keys, will not it be better to make a batch request for all row keys at one go instead of making 1000 requests one by one in mapper? – hp36 May 30 '16 at 10:09
  • 1
    Yeah of course you can use `Result[] get(List gets)` method of `Table` interface to make a batch get request. – maxteneff May 30 '16 at 10:48
1

You can use this kind of methods in your mapper which worked well for me it will return array of Result.

/**
     * Method getDetailRecords.
     * 
     * @param listOfRowKeys List<String>
     * @return Result[]
     * @throws IOException
     */
    private Result[] getDetailRecords(final List<String> listOfRowKeys) throws IOException {
        final HTableInterface table = HBaseConnection.getHTable(TBL_DETAIL);
        final List<Get> listOFGets = new ArrayList<Get>();
        Result[] results = null;
        try {
            for (final String rowkey : listOfRowKeys) {// prepare batch of get with row keys
   // System.err.println("get 'yourtablename', '" + saltIndexPrefix + rowkey + "'");
                final Get get = new Get(Bytes.toBytes(saltedRowKey(rowkey)));
                get.addColumn(COLUMN_FAMILY, Bytes.toBytes(yourcolumnname));
                listOFGets.add(get);
            }
            results = table.get(listOFGets);

        } finally {
            table.close();
        }
        return results;
    }
Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
  • If even single key is not present in table, will `Results[]` be empty? Because this issue is raised [here](http://stackoverflow.com/questions/13310434/hbase-api-get-data-rows-information-by-list-of-row-ids) by @Gevorg – hp36 May 30 '16 at 11:01
  • I think so, If row is not there in hbase, then It might return null. I dont remember exactly. This is some thing you can practically try and find out.... I dont have envt here to test. But this is best approach works for your requirement. – Ram Ghadiyaram May 30 '16 at 11:09
  • 1
    I would like to add one more thing... It's better to use `void batch(final List extends Row> actions, final Object[] results)` [method](http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hbase/hbase-client/1.0.0-cdh5.4.3/org/apache/hadoop/hbase/client/Table.java#Table.batch%28java.util.List%2Cjava.lang.Object%5B%5D%29) because If any exception is thrown by one of the actions, partially executed results can be retrievd in `Object[] results`. – hp36 Jun 06 '16 at 07:01
  • when we call HTable get(List) method, it will call the processBatchCallback method in HConnectionImplementation. In this method, it finds out the region servers that host the data and then establish calls to these servers.[link](https://bigdataexplorer.wordpress.com/2013/07/29/hbase-row-key-design/) – Ram Ghadiyaram Jun 26 '16 at 17:59
  • Since HBase table connection creation is costlier operation, so can we create connection in driver's program and share it with each mapper instead of creating connection in `setup()` method of mapper? – hp36 Jun 29 '16 at 08:39
  • no you don't know how many mappers are launched till it happens so you don't know how many connections needs to be supplied. so its better to open con in setup and close in cleanup – Ram Ghadiyaram Jun 29 '16 at 09:35
  • I was meant to share HBase [Connection](https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Connection.html) class obtained via `ConnectionFactory.createConnection(hconf)` with each mapper. There is only one cluster connection to be shared with each mapper, so no need of knowing no of mappers. Since connection creation is heavy-weight operation so it should not be done again in mapper. Do it once in driver and share it with each mapper. – hp36 Jun 30 '16 at 09:35
  • I think if I serialize the hbase connection object and set it in `Configuration` class inside driver's program and again de-serialize it in `setup()` method of mapper to get original connection object, then it can be shared with each mapper. A nice example is shown [here](http://stackoverflow.com/questions/26983741/passing-objects-to-mapreduce-from-a-driver/26984356). – hp36 Jun 30 '16 at 09:39