Nothing actually stops anyone from preparing HFile 'by hands' but doing so you start to depend on HFile compatibility issues. In accordance to this (https://hbase.apache.org/book/arch.bulk.load.html) you just need to put your files to HDFS ('closer' to HBase) and call completebulkload
.
Proposed strategy:
- Check HFileOutputFormat2.java
file from HBase sources. It is standard MapReduce OutputFormat. What you indeed need as base for this is just sequence of KeyValue
elements (or Cell
if we speak in term or interfaces).
- You need to free HFileOutputFormat2
from MapReduce. Check for its writer logics for this. You need only this part.
- OK, you need also to build effective solution for Put
-> KeyValue
stream handling for HFile. First place to look is TotalOrderPartitioner
and PutSortReducer
.
If you did all steps you have solution that can take sequence of Put (no issue to generate them from any data) and as a result you have local HFile. Looks like this should take up to week to get something pretty working.
I don't go this way because just having good InputFormat and data transforming mapper (which I have long ago) I now can use standard TotalOrderPartitioner
and HFileOutputFormat2
INSIDE MapReduce framework to have everything working just using full cluster power. Feel confused by 10G SQL dump loaded in 5 minutes? Not me. You can't beat such speed using single server.
OK, this solution required careful SQL requests design for SQL DB to perform ETL process from. But now it's everyday procedure.