Suggestions or Ideas on how to join the naive bayes model from training data and test data on hadoop

Question

I built my Naive Bayes Classifier for Text Classification on Java. Now I am trying to port it on hadoop. I have built the model using mappper and reducer and the output is like:

label1,word1    count
label1,word2    count
label1,word3    count
.
.
.
label2,word1    count
label2,word2    count
label2,word3    count
.
label3,word1    count
.
label4,word1    count

Total there are 4 labels that I have to classify the test data into: After building the model I am not able to move forward that how to classify the test data using model and using map reduce. Here is my current code:-

public class TrainHadoop extends Configured implements Tool {
private static final String OUTPUT_PATH = "/user/nitin/interOutput";

private  static String[] classes = {"CCAT", "ECAT", "GCAT", "MCAT"};
// for training data
public static class TrainMap extends Mapper<LongWritable, Text, Text, IntWritable>{

    private static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public Vector<String> tokenizeDoc(String cur_doc) {
        String[] words = cur_doc.split("\\s+");
        Vector<String> tokens = new Vector<String>();
        for (int i = 0; i < words.length; i++) {
            words[i] = words[i].replaceAll("\\W|_", "");
            if (words[i].length() > 0) {
                tokens.add(words[i].toLowerCase());
            }
        }
        return tokens;  
    }

    public void map(LongWritable key, Text value, Context context )
        throws IOException, InterruptedException{
        String[] line  = value.toString().split("\t");
        String[] labelsArray = line[0].split(",");

        Vector<String> indivWord = tokenizeDoc(line[1]);

        List<String> finalLabelsArray = new ArrayList<>();
        for (int i = 0; i < classes.length; i++) {
            for (int j = 0; j < labelsArray.length; j++) {
                if(classes[i].equals(labelsArray[j])){
                    finalLabelsArray.add(classes[i]);
                }
            }
        }
        word.set("labelsInstances");
        context.write(word, new IntWritable(finalLabelsArray.size()));

        for(String label : finalLabelsArray){
            // variable and logic for storing total instances of each class
            context.write(new Text(label), one);

            // total no. of words for each class
            context.write(new Text(label + "*"), new IntWritable(indivWord.size()));

            // for each class calculating the no. of occurence of each word
            for (int i  = 0; i < indivWord.size(); i++) {
                context.write(new Text(label + "^," + indivWord.get(i)), one);
            }
            // for vocab size
            for (int i  = 0; i < indivWord.size(); i++) {
                context.write(new Text("A=" + indivWord.get(i)), one);
            }
        }
    }
}

// mappers for test data set classifying
public static class TestMap1 extends Mapper<LongWritable, Text, Text, IntWritable>{

}

public static class TestMap2 extends Mapper<LongWritable, Text, Text, IntWritable>{


}

public static class TrainReduce extends Reducer<Text,IntWritable,Text,IntWritable> {
    public void reduce(Text key, Iterable<IntWritable> values, Context context) 
        throws IOException, InterruptedException{
        int sum = 0;
        for(IntWritable val : values){
            sum += val.get();
        }
        context.write(key, new IntWritable(sum));
    }
}   

public static class TestReduce extends Reducer<Text,IntWritable,Text,IntWritable> {
        }

@Override
public int run(String[] args) throws Exception {
  /*
   * Job 1
   */

        Configuration trainConf = new Configuration();
        Job trainJob  = Job.getInstance(trainConf, "training");

        trainJob.setJarByClass(TrainHadoop.class);


        trainJob.setMapperClass(TrainMap.class);
        trainJob.setReducerClass(TrainReduce.class);
        //trainJob.setCombinerClass(Reduce.class);

        trainJob.setInputFormatClass(TextInputFormat.class);
        trainJob.setOutputFormatClass(TextOutputFormat.class);

        // output from reducer and mapper
        trainJob.setOutputKeyClass(Text.class);
        trainJob.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(trainJob, new Path(args[0]));
        FileOutputFormat.setOutputPath(trainJob, new Path(OUTPUT_PATH));    

        trainJob.waitForCompletion(true);
    /*
     * Job 2
     */

        Configuration testConf = new Configuration();
        Job testJob = Job.getInstance(testConf, "training");

        testJob.setJarByClass(TrainHadoop.class);                                                                                           

        testJob.setMapperClass(TestMap1.class);
        testJob.setMapperClass(TestMap2.class);
        testJob.setReducerClass(TestReduce.class);
        //testJob.setCombinerClass(Reduce.class);

        testJob.setInputFormatClass(TextInputFormat.class);
        testJob.setOutputFormatClass(TextOutputFormat.class);

        // output from reducer and mapper
        testJob.setOutputKeyClass(Text.class);
        testJob.setOutputValueClass(IntWritable.class);


        MultipleInputs.addInputPath(testJob, new Path(OUTPUT_PATH + "/part-r-[0-9]*"), TextInputFormat.class, TestMap1.class);
        MultipleInputs.addInputPath(testJob, new Path(args[1]), TextInputFormat.class, TestMap2.class);
        FileOutputFormat.setOutputPath(testJob, new Path(args[2])); 

        return testJob.waitForCompletion(true) ? 0 : 1;     
}

public static void main(String[] args) throws Exception{
    ToolRunner.run(new Configuration(), new TrainHadoop(), args);       
}
}

You're missing a step, which is actually writing the probability of P(label | string) instead of the count (and also the prior probability of P(label)). Then testing is basically just emitting words for each possible label and multiplying probabilities. — Thomas Jungblut, Sep 15 '15 at 18:22
I am thinking that first build the model, get the count of each word in training in a particular class, then in test job(2nd job) for each class find the probab of word in test doc., add the log proabilities of all words then, emitting out the label with maximum probability, I mean thats how I did without hadoop while only passing the outputs through pipe in linux...I confused how to do it using map reduce. — Nicky, Sep 15 '15 at 18:28

Suggestions or Ideas on how to join the naive bayes model from training data and test data on hadoop

0 Answers0