0

I am writing a job to scan an HBase table and delete certain rows. I've read that I should batch up my deletes and flush them out periodically rather than process each individual delete or process the entire batch at once. My code right now is equivalent to..

void addDeleteToBatch(Delete delete) {

  deleteBatch.add(delete);

  if (deleteBatch.size() >= 1000) {
    flushDeletes();
  }
}

void flushDeletes() {

  if (!deleteBatch.isEmpty()) {
    hbase.batchDelete("table_name", deleteBatch);
  }

  deleteBatch.clear();

  log("batch flushed");
}

I have no real reason for choosing 1000 as the maximum batch size however. I can't find any resources that hint at how many operations should be batched at a time. Are there any guidelines to this? Intuitively, it seems that it would be very inefficient to not batch things at all, or to do very small batches. It also seems that very large batch sizes would be bad. Is there an efficiency sweet spot?

Nathan Norman
  • 145
  • 1
  • 6

1 Answers1

1

If you are doing thousands of deletes then you should use BulkDelete coprocessor:https://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/coprocessor/example/BulkDeleteProtocol.html

If you dont want to use above Coprocessor, then you will need to find out the sweet spot for batching. It can be 100, it can be 1000.

Anil Gupta
  • 1,116
  • 2
  • 8
  • 13