0

I've got a Hadoop SequenceFile where the key is IntWritable and the value is some arbitrary Java class implementing Writable, and with an interesting toString() method. I would love to make a two column Hive table where the first column is the key as an int and the second column is the value as a string or varchar.

I would love to do this in the most tasteful and easiest way possible: I shouldn't have to write 200 lines of code to say "just decode this and then call toString()".

My current solution is just to do an extra MapReduce job to put the thing in the format I want before inputting it into Hive, but I find this is offensive for obvious reasons.

Thanks!

Joseph Victor
  • 819
  • 6
  • 16

2 Answers2

0

You can read Sequence files directly from Hive. For you case you need to implement org.apache.hadoop.hive.serde2.Deserializer

In the deserializer you can call the toString method. Should not be more than 30 lines of code.

Venkat
  • 1,810
  • 1
  • 11
  • 14
0

The following example uses ThriftDeserializer class as the SerDe for the table. You can create your own SerDe (implement Serializer/Deserialiser interfaces of Hive) and provide that when creating your table.

CREATE EXTERNAL TABLE IF NOT EXISTS test
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer'
    with serdeproperties("serialization.format"="org.apache.thrift.protocol.TCompactProtocol",
    "serialization.class"="some.package.ClassName")
    STORED AS SEQUENCEFILE
Param
  • 2,420
  • 1
  • 14
  • 10