Here's what I'm trying to do:
Load data from Hive into HBase serialized by protocol buffers.
I've tried multiple ways:
create connections directly to HBase and do Puts into HBase. This works, but apparently not very efficient.
I imported the json table out from Hive in S3 and stored them as textfiles (separated by tab), and then use importTsv utilities to generate HFile and bulkload them into HBase, this also works.
But now I want to achieve this in an even more efficient way:
Export my data from Hive table in S3, serialize them into protocol buffers objects, then generate HFile and mount the HFile directly onto HBase.
I'm using Spark job to read from Hive and that can give me JavaRDD, then I could build my protocol buffers objects, but I'm at a loss how to proceed from there.
So my question: how can I generate HFile from protocol buffers objects. We don't want to save them as a textfile on local disk or HDFS, how can I directly generate HFile from there?
Thanks a lot!