0

Bulk load usually uses map reduce to create a file on HDFS and this file is then assoicated with a region.

If thats the case, can my client create this file (locally) and put it on hdfs. See as we already know what keys are , what values, we can do it locally without loading the server. Can someone point to an example, how hfile can be created (in any language will be fine)

regards

Community
  • 1
  • 1
user3529980
  • 61
  • 2
  • 5

1 Answers1

0

Nothing actually stops anyone from preparing HFile 'by hands' but doing so you start to depend on HFile compatibility issues. In accordance to this (https://hbase.apache.org/book/arch.bulk.load.html) you just need to put your files to HDFS ('closer' to HBase) and call completebulkload.

Proposed strategy: - Check HFileOutputFormat2.java file from HBase sources. It is standard MapReduce OutputFormat. What you indeed need as base for this is just sequence of KeyValue elements (or Cell if we speak in term or interfaces). - You need to free HFileOutputFormat2 from MapReduce. Check for its writer logics for this. You need only this part. - OK, you need also to build effective solution for Put -> KeyValue stream handling for HFile. First place to look is TotalOrderPartitioner and PutSortReducer.

If you did all steps you have solution that can take sequence of Put (no issue to generate them from any data) and as a result you have local HFile. Looks like this should take up to week to get something pretty working.

I don't go this way because just having good InputFormat and data transforming mapper (which I have long ago) I now can use standard TotalOrderPartitioner and HFileOutputFormat2 INSIDE MapReduce framework to have everything working just using full cluster power. Feel confused by 10G SQL dump loaded in 5 minutes? Not me. You can't beat such speed using single server.

OK, this solution required careful SQL requests design for SQL DB to perform ETL process from. But now it's everyday procedure.

Roman Nikitchenko
  • 12,800
  • 7
  • 74
  • 110