I am writing a job to scan an HBase table and delete certain rows. I've read that I should batch up my deletes and flush them out periodically rather than process each individual delete or process the entire batch at once. My code right now is equivalent to..
void addDeleteToBatch(Delete delete) {
deleteBatch.add(delete);
if (deleteBatch.size() >= 1000) {
flushDeletes();
}
}
void flushDeletes() {
if (!deleteBatch.isEmpty()) {
hbase.batchDelete("table_name", deleteBatch);
}
deleteBatch.clear();
log("batch flushed");
}
I have no real reason for choosing 1000 as the maximum batch size however. I can't find any resources that hint at how many operations should be batched at a time. Are there any guidelines to this? Intuitively, it seems that it would be very inefficient to not batch things at all, or to do very small batches. It also seems that very large batch sizes would be bad. Is there an efficiency sweet spot?