0

I have a large .txt file of records that I need to convert into a (hadoop) sequence format for efficiency. I have found some answers to this online (such as How to convert .txt file to Hadoop's sequence file format), but I'm new to hadoop and don't really understand them. If you could explain these a little more, or if you have another solution, that'd be great. If it helps, the records are separated by line.

Thanks in advance.

Community
  • 1
  • 1
Jonathan
  • 3
  • 2
  • How do you want the line to be tokenized into a Key and Value? (Typically the Key is a line number and the Value is the line text) – Chris White Jun 22 '12 at 01:55
  • Like you said. Key:line number, Value:line text. – Jonathan Jun 22 '12 at 02:02
  • In the answer you have linked to, which specific part did you not understand or want more clarification on? – Hari Menon Jun 22 '12 at 03:02
  • 1
    Normally the key is the byte offset and the value is the line of text, just to clarify. – Thomas Jungblut Jun 22 '12 at 08:09
  • Well, the top answer looked pretty complete, so that was what I tried first. At the top of the code I added import statements, but when I compile I get errors. I think some of the code is deprecated. The second answer seemed simpler, but incomplete. Again, I'm brand new to hadoop. – Jonathan Jun 22 '12 at 14:41

1 Answers1

1

Since you said you were new to hadoop, do you know the basic idea of Mapper and Reducer? Both of them have KEY_IN_CLASS, VALUE_IN_CLASS, KEY_OUT_CLASS, VALUE_OUT_CLASS, so in your case, you can simple use mapper to do the convert,

for KEY_IN_CLASS, you can use the default LongWritable,

VALUE_IN_CLASS you need to use Text, since Text class deals with text input.

For KEY_OUT_CLASS, you can use NullWritable, it's a null key if you don't have a specific key.

For VALUE_OUT_CLASS, use SequenceFileOutputFormat.

I believe in order to use SequenceFileOutputFormat, you need to tell SequenceFileOutputFormat what key class and value class you use.

Chun
  • 279
  • 1
  • 7