0

I was working with Hadoop MapRedue, and had a question. Currently, my mapper's input KV type is LongWritable, LongWritable type and output KV type is also LongWritable, LongWritable type. InputFileFormat is SequenceFileInputFormat. Basically What I want to do is to change a txt file into SequenceFileFormat so that I can use this into my mapper.

What I would like to do is

input file is something like this

1\t2 (key = 1, value = 2)

2\t3 (key = 2, value = 3)

and on and on...

I looked at this thread How to convert .txt file to Hadoop's sequence file format but reliazing that TextInputFormat only support Key = LongWritable and Value = Text

Is there any way to get txt and make a sequence file in KV = LongWritable, LongWritable?

Community
  • 1
  • 1
user1566629
  • 23
  • 2
  • 5

1 Answers1

7

Sure, basically the same way I told in the other thread you've linked. But you have to implement your own Mapper.

Just a quick scratch for you:

public class LongLongMapper extends
    Mapper<LongWritable, Text, LongWritable, LongWritable> {

  @Override
  protected void map(LongWritable key, Text value,
      Mapper<LongWritable, Text, LongWritable, LongWritable>.Context context)
      throws IOException, InterruptedException {

    // assuming that your line contains key and value separated by \t
    String[] split = value.toString().split("\t");

    context.write(new LongWritable(Long.valueOf(split[0])), new LongWritable(
        Long.valueOf(split[1])));

  }

  public static void main(String[] args) throws IOException,
      InterruptedException, ClassNotFoundException {

    Configuration conf = new Configuration();
    Job job = new Job(conf);
    job.setJobName("Convert Text");
    job.setJarByClass(LongLongMapper.class);

    job.setMapperClass(Mapper.class);
    job.setReducerClass(Reducer.class);

    // increase if you need sorting or a special number of files
    job.setNumReduceTasks(0);

    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(LongWritable.class);

    job.setOutputFormatClass(SequenceFileOutputFormat.class);
    job.setInputFormatClass(TextInputFormat.class);

    FileInputFormat.addInputPath(job, new Path("/input"));
    FileOutputFormat.setOutputPath(job, new Path("/output"));

    // submit and wait for completion
    job.waitForCompletion(true);
  }
}

Each value in your mapper function will get a line of your input, so we are just splitting it by your delimiter (tab) and parsing each part of it into longs.

That's it.

Thomas Jungblut
  • 20,854
  • 6
  • 68
  • 91