0

I have an HBASE table with 100 of thousands of rows and we're experience issues with hotspotting.

I'd like to recreate this table with salted row keys.

I've attempted to "org.apache.hadoop.hbase.mapreduce.Import/CopyTable" into a new salted table, but it doesn't prefix the row keys with salt.

The only solution I've experienced that worked in migrating rows with prefix was a Phoenix query: UPSERT INTO TABLE_SALTED SELECT * FROM TABLE

However, this is VERY inefficient and takes way too long.

How do I salt an existing HBASE / Phoenix table with minimal downtime?

jn5047
  • 101
  • 1
  • 7
  • If read load is your only problem, maybe you can consider region replication? https://stackoverflow.com/questions/35108526/does-hbase-have-region-replications – mazaneicha Dec 31 '21 at 15:15

2 Answers2

0

Generally Hbase uses splitting to handle "hotspots".

That said you can manually split a table:

split '[table_to_split]', '[split point]'

This is more efficient as you are using the tools that comes with HBASE and doesn't require an entire re-write. It will only help you push the needle a little but sometimes that's enough to limp along.

They are a lot of settings you can play with to help things. Look into RegionSplitPolicy and see if you can find some help there.

If you want to look at a really good article on splitting this read this cloudera post.

I'm not sure how much intention you put into picking your splits but you really can't get better optimization than picking solid pre-split points to work around your data. (If you are salting it's likely you already discovered skew in your data and even reasonable intentions in picking splits doesn't handle skew. Well unless you already knew about skew.)

Matt Andruff
  • 4,974
  • 1
  • 5
  • 21
  • You know if I was you I might right my own RegionSplitPolicy specifically to handle the skewed data. It would allow for better logic to handle skew. (you could use an existing class for all non-skewed data and just do special things for skewed data) – Matt Andruff Dec 30 '21 at 14:08
  • Thanks for the reply. We've been splitting tables and moving regions (based on read requests) around to better balance the load. The hotspot issue we're experience isn't necessary caused by the size of our region, but the number of read request of a region. We're experiencing unbalance of read requests among regions, so some small-sized regions are experience heavy loads. – jn5047 Dec 30 '21 at 21:19
  • This is a bandaid fix but it won't completely solve our solution as salting the table and throwing more CPU in. And yes, unfortunately, we've just discovered skews loooong after the design of our table. Our table now has ~60GB of data. – jn5047 Dec 30 '21 at 21:27
0

If this hotspotting issue is caused by repeated reads why not try increasing file.block.cache.size & hbase_regionserver_heapsize.

'file.block.cache.size' - Portion of heap used for cache. hbase_regionserver_heapsize - Size of region server's heap.

You can just increase file.block.cache.size but you may then end up putting more pressure on heap.

The next obvious question is by how much? The answer is the same for all performance optimizations. Get an expert to try and calculate it, or just keep adding a little until you run out of space/you stop seeing improvement.

Matt Andruff
  • 4,974
  • 1
  • 5
  • 21