HBase
HBase persists data into large files called HFiles, these are big in size (order of magnitude of hundreds of MB, or around GB).
When HBase wants to read, it first checks in the memstore if the data is in memory from a recent update or insertion, if that data is not in memory it will find the HFiles having a range of keys that could contain the data you want (only 1 file if you ran compactions).
An HFile contains many data blocks (the HBase blocks of 64kB by default), these blocks are small to allow for fast random access. And at the end of the file, there is an index referencing all these blocks (with the range of keys in the block and offset of the block in the file).
When first reading an HFile, the index is loaded and kept in memory for future accesses, then:
- HBase performs a binary search in the index (fast in memory) to locate the block that potentially contains the key you asked for
- Once the block is located, HBase can ask the filesystem to read this specific 64k block at this specific offset in the file, resulting in a single disk seek to load the data block you want to check.
- The loaded 64k HBase block will be searched for the key you asked, and the key-value returned if it exists
If you have small HBase blocks, you’ll have more efficient disk usage when performing random accesses, but it will increase the index size and the memory needs.
HDFS
All the file system accesses are executed by HDFS which has blocks (64MB by default). In HDFS the blocks are used for distribution and data locality, which means that a file of 1GB will be splitted in 64MB chunks to be distributed and replicated. These blocks are big because to ensure that batch processing time is not only spent in disk seeks, as the data is contiguous in that chunk.
Conclusion
HBase blocks and HDFS blocks are different things:
- HBase blocks are the unit of indexing (as well as caching and compression) in HBase and allow for fast random access
- HDFS blocks are the unit of the filesystem distribution and data locality
The tuning of the HDFS block size compared to your HBase parameters and your needs will have performance impacts, but this is a more subtle matter.