HBase on Hadoop, data locality deep diving

Question

I have read multiple articles about how HBase gain data locality i.e link or HBase the Definitive guide book.

I have understood that when re-writing HFile, Hadoop would write the blocks on the same machine which is actually the same Region Server that made compaction and created bigger file on Hadoop. everything is well understood yet.

Questions:

Assuming a Region server has a region file (HFile) which is splitted on Hadoop to multiple block i.e A,B,C. Does that means all block (A,B,C) would be written to the same region server?
What would happen if HFile after compaction has 10 blocks (huge file), but region server doesn't have storage for all of them? does it means we loose data locality, since those blocks would be written on other machine?

Thanks for the help.

score 1 · Accepted Answer · answered Oct 06 '16 at 17:33

1

HBase uses HDFS API to write data to the distributed file sytem (HDFS). I know this will increase your doubt on the data locality. When a client writes data to HDFS using the hdfs API, it ensures that a copy of the data is written to the local datatnode (if applicable) and then go for replication. Now I will answer your questions,

Yes. HFile(blocks) written by a specific RegionServer(RS) resides in the local datanode until it is moved for load balancing or recovery by the HMaster(will be back on major compaction). So the blocks A,B,C would be there in the same region server.
Yes. This may happen. But we can control the same by configuring region start and end key for each regions for HBase tables at creation time, which allows the data to be equally distributed in the cluster.

Hope this helps.

answered Oct 06 '16 at 17:33

Devas

1,544
4
23
28

Did I answer your question ? – Devas Oct 07 '16 at 12:09
yes thanks :) , can you elaborate on your answer to the second question? i.e How can I control it ? – David H Oct 09 '16 at 04:20
We can pre split HBase table regions, then a set of keys will be stored in that specific region with in a specific region server. For more understanding, Region server contains regions of sorted keys, which means if you have the set of keys 0 to 20 then you can specify the region as 0 to 5, 6to 10, 11 to 15 and 16 to 20. If you have 4 region servers the load balancer will normally allocate the regions to different region servers hence the data will be equally distributed. For more details check [here](http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/). – Devas Oct 09 '16 at 18:05

HBase on Hadoop, data locality deep diving

1 Answers1