2

I have a Java function that runs on a single HBase row (a Result), it takes a Result as an input and outputs a byte[]. I would like to run this function on 10K-100K HBase rows and collect the results. I have a List<byte[]> which is the rows I'd like to run this function on, they are distributed evenly across all regions of the table. I would like to do so under these constraints:

  • Not ship all the rows from the server to the client
  • No long job init, the entire operations is expected to run in under a second
  • Utilize processing power of the Hadoop cluster and not the processing power of the client
  • Obviously, not depend upon the size of the HBase table which can be billions of rows

What's the best way to achieve this? I've thought of these options:

  • Spark - I'm not sure if this is a good option if my job runs on a tiny % of the number of rows in the table
  • Coprocessor - is there a way to run coprocessors in bulk on a List<byte[]> of rowkeys and collect the result? Will the work be processed in parallel by the cluster?
  • Implementing a custom HBase filter and then doing a bulk Get on the List<byte[]> with the custom filter - The Get will be processed by all region servers in parallel and can run custom logic, but this seems like a hack and I'm not sure a custom filter can return data that wasn't present in one of the columns of the row.
Community
  • 1
  • 1
ytoledano
  • 3,003
  • 2
  • 24
  • 39

0 Answers0