1

I'm trying to read data from HBase, process it and then write to Hive. I'm new to both Scalding and Scala.

I have looked in to SpyGlass for reading from HBase. It works well and I can read the data and then write the it a file.

val data = new HBaseSource(
tableName,
hbaseHost,
SCHEMA.head,
SCHEMA.tail.map((x: Symbol) => "data"),
SCHEMA.tail.map((x: Symbol) => new Fields(x.name)),
sourceMode = SourceMode.SCAN_ALL)
.read
.fromBytesWritable(SCHEMA)
.debug
.write(Tsv(output.format("get_list")))

So the question is now how I can write it to Hive. If someone has managed to do this, I would be grateful for a simple example or some help to accomplish this.

1 Answers1

1

You don't actually need to do anything special to write to Hive - your current code is absolutely fine. Hive simply applies metadata on top of data stored within the HDFS. All you need to do is create a Hive table on top of the data you're writing. You have two main options. If you want to move your data to the Hive warehouse, you'll need to load it in with a command like:

load data inpath '/your/file/or/folder/on/the/hdfs' into table your_table;

If you don't want to move the data, you can create an external Hive table which doesn't move the data. The advantages of an external table are that

  • you don't have to load data into it,
  • dropping the table doesn't delete the data.
Ben Watson
  • 5,357
  • 4
  • 42
  • 65
  • I haven't thought about that, I guess that the external way is the best way for me. Thank you, I'm gonna try that. – user2299491 Feb 23 '15 at 08:19
  • My best approach right now is to create a external table that are partitioned and use Orc format. From scalding I read data from hbase and write to a OrcFile (more info here https://github.com/branky/cascading.hive/blob/master/src/main/scala/com/twitter/scalding/ColumnarSerDeSource.scala) and place it in a new partition for my external table. But I still need to execute a hive command after that outside my scalding application: "alter table table_name add partition(dt='2015-03-17') location 'hdfs://apps/hive/warehouse.../dt=2015-03-21'". I would prefer to load only through scalding.. – user2299491 Mar 17 '15 at 12:03