Inverted list in Map Reduce Hadoop

Question

I am trying to modify this code to produce a full inverted list. By that i mean, getting an index of each word in the file location. That is if we have two file containing the words

  abc.txt =    I am coming to the park to play, yes i am.

  def.txt = Please come on over, i will be waiting for you

i should have something like this:

i /home/abc.txt: 1 10 /home/def.txt: 5

This means the letter i is the 1st and 10th word in file abc.txt and the 5th word in file def.txt

I have modified the code to provide "word location and word frequency" as shown below:

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;

public class WordCountByFile extends Configured implements Tool {

    public static void main(String args[]) throws Exception {
        String[] argsLocal = {
            "input#2", "output#2"
        };
        int res = ToolRunner.run(new WordCountByFile(), argsLocal);
        System.exit(res);
    }

    public int run(String[] args) throws Exception {
        Path inputPath = new Path(args[0]);
        Path outputPath = new Path(args[1]);

        Configuration conf = getConf();
        Job job = new Job(conf, this.getClass().toString());

        FileInputFormat.setInputPaths(job, inputPath);
        FileOutputFormat.setOutputPath(job, outputPath);

        job.setJobName("WordCountByFile");
        job.setJarByClass(WordCountByFile.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setMapperClass(Map.class);
        job.setCombinerClass(Reduce.class);
        job.setReducerClass(Reduce.class);

        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static class Map extends Mapper < LongWritable, Text, Text, IntWritable > {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {

                String filePathString = ((FileSplit) context.getInputSplit()).getPath().toString();

                word.set(tokenizer.nextToken() + " " + filePathString + " : ");
                context.write(word, one);
            }
        }
    }

    public static class Reduce extends Reducer < Text, IntWritable, Text, IntWritable > {

        @Override
        public void reduce(Text key, Iterable < IntWritable > values, Context context) 
            throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable value: values) {
                sum += value.get();
            }
            context.write(key, new IntWritable(sum));
        }
    }
}

I know it has to go with some indexing like in Java, but i'm try to figure out how to do that in Hadoop Map Reduce. Any help guys?

maxteneff · Answer 1 · 2017-09-30T18:28:16.937

Just a few thoughts about your problem.

Input format:

TextInputFormat uses every row of input files as input records. So you should use input format which provides access to the whole file as one input record. You can use this WholeFileRecordReader, for example.

Mapper:

Mapper should return info about every word in the input record. The return key is the word and the return value is any structure which contains input file and position of the current word in the file. You can write your own Writable class or merge this info to the string and return Text class as you do right now.

Reducer:

Reducer should merge info for every word. Just loop through all values which passed to reducer with one key and generate result strings in the format that you described.

Inverted list in Map Reduce Hadoop

1 Answers1