HBase: execute small job using cluster

Question

I have a Java function that runs on a single HBase row (a Result), it takes a Result as an input and outputs a byte[]. I would like to run this function on 10K-100K HBase rows and collect the results. I have a List<byte[]> which is the rows I'd like to run this function on, they are distributed evenly across all regions of the table. I would like to do so under these constraints:

Not ship all the rows from the server to the client
No long job init, the entire operations is expected to run in under a second
Utilize processing power of the Hadoop cluster and not the processing power of the client
Obviously, not depend upon the size of the HBase table which can be billions of rows

What's the best way to achieve this? I've thought of these options:

Spark - I'm not sure if this is a good option if my job runs on a tiny % of the number of rows in the table
Coprocessor - is there a way to run coprocessors in bulk on a List<byte[]> of rowkeys and collect the result? Will the work be processed in parallel by the cluster?
Implementing a custom HBase filter and then doing a bulk Get on the List<byte[]> with the custom filter - The Get will be processed by all region servers in parallel and can run custom logic, but this seems like a hack and I'm not sure a custom filter can return data that wasn't present in one of the columns of the row.

HBase 0.98 but I can upgrade to 1.1 if there's anything I need from it. Thanks. — ytoledano, May 15 '16 at 19:16

HBase: execute small job using cluster

0 Answers0