I have a Java function that runs on a single HBase row (a Result
), it takes a Result
as an input and outputs a byte[]
. I would like to run this function on 10K-100K HBase rows and collect the results. I have a List<byte[]>
which is the rows I'd like to run this function on, they are distributed evenly across all regions of the table. I would like to do so under these constraints:
- Not ship all the rows from the server to the client
- No long job init, the entire operations is expected to run in under a second
- Utilize processing power of the Hadoop cluster and not the processing power of the client
- Obviously, not depend upon the size of the HBase table which can be billions of rows
What's the best way to achieve this? I've thought of these options:
- Spark - I'm not sure if this is a good option if my job runs on a tiny % of the number of rows in the table
- Coprocessor - is there a way to run coprocessors in bulk on a
List<byte[]>
of rowkeys and collect the result? Will the work be processed in parallel by the cluster? - Implementing a custom HBase filter and then doing a bulk
Get
on theList<byte[]>
with the custom filter - TheGet
will be processed by all region servers in parallel and can run custom logic, but this seems like a hack and I'm not sure a custom filter can return data that wasn't present in one of the columns of the row.