HBase salting and effective data retrieval on range scans

Question

In order to avoid hot spotting of region servers in HBase it is advised to avoid sequential row keys. One of the approaches is to salt the first byte of the row key. I want to employ this technique in my client code.

Lets say I have n number of region servers, each region server may hold up to m number of regions. n*m would be total number of regions then.

x, value of the first byte will be 1 < x <= n*m.

On the write path, when inserting data I'd be randomly generating value of x and prepend it to my row key. That should help with even distribution of the keys.

Q1:Should I actually be smarter with regards to the salt generation strategy?

I need to perform a range scan (timeseries data). Since my data is scattered across several regions, I plan to place in parallel n*m number of scan requests. Each will be executing in its own thread. After results are back, I'll perform aggregation in the client code.

Q2:Is there a way to group those requests so that instead of placing a scan per region I could do a request per region server?

I know that Apache Phoenix is doing something similar under the covers. But I think they are achieving this with coprocessors.

http://stackoverflow.com/a/41978281/2513573 for Q1. Q2: maybe its better to use map reduce for that? — AdamSkywalker, Feb 09 '17 at 11:39
Re: answer to Q1, so if number of regions changes, I'd have to copy my data over to the new table that has new number of regions set? Why are you explicitly disabling region splits? — Ihor M., Feb 09 '17 at 14:45
Re: Q2: what map reduce are you talking about? I've done some HBase API research and it looks like from Connection object there is a way to retrieve RegionLocator for a table. Which, in turn has method getAllRegionLocations() that returns objects of type HRegionLocation. HRegionLocation has methods getHostName() and getServerName(). I would be able to determine number of unique region servers that hold my regions. — Ihor M., Feb 09 '17 at 15:02
Q1: yes, if number of regions changes you'll have to migrate data. That's not how you use it. You should estimate how many regions you will need. Q2: You can write hadoop map-reduce job and use hbase as input. HBase is installed over hdfs and usually hadoop is setup too. — AdamSkywalker, Feb 09 '17 at 15:14

HBase salting and effective data retrieval on range scans

0 Answers0