The purpose is to know how much a file X is similar to files y1, y2, ..., yn.
For each file, I extract informations and I store them in structures; let's say that from a file I do a word count, and I store results in a HashMap<String, Integer> wordCount
(there are other structures storing other infos).
So I need to generate wordCount of fileX; extract wordCount of fileY (pre-generated and written on HDFS files); calculate how much these two word counts are similar (I cannot make a line-by-line difference; I need similarity in percentage).
FileX is fixed and needs to be compared to N fileY.
So my idea was:
Job1: calculate fileX informations and writing them on HDFS.
Job2 (chainMapper of map1-map2):
Map1: read HashMap<String, Integer> wordCount
of fileX; passing structures to Map2.
Map2: gets 2 inputs, structures of fileX, path to directory of fileYs.
Map2 calculates the similarity of HashMap<String, Integer> wordCountX
and HashMap<String, Integer> wordCountY
; the reducer gets all the values of similarity and order them.
I have read on Hadoop - The definitive guide of Tom White
and online too about MultipleInputs
, but it is not about two inputs to 1 mapper, but to differentiate mappers based on inputs. So I want to ask how to forward two values to a single mappers; I have considered using distributed cache but it's not something useful for this problem; and last, how to be sure that each mapper gets a different fileY.
I have tried to update a global HashMap<String, Integer> wordCount
but when a new job starts, it cannot access that structure (or better, it's empty).
public class Matching extends Configured implements Tool{
private static HashMap<String, Integer> wordCountX;
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Matching(), args);
System.exit(res);
} //end main class
public int run(String[] args) throws Exception {
...
}
}
EDIT:
The answer of behold is a nice solution.
I add the resultant code snippet.
Launching job:
//configuration and launch of job
Job search = Job.getInstance(getConf(), "2. Merging and searching");
search.setJarByClass(this.getClass());
MultipleInputs.addInputPath(search, creationPath, TextInputFormat.class);
MultipleInputs.addInputPath(search, toMatchPath, TextInputFormat.class);
FileOutputFormat.setOutputPath(search, resultPath);
search.setNumReduceTasks(Integer.parseInt(args[2]));
search.setMapperClass(Map.class);
search.setReducerClass(Reduce.class);
search.setMapOutputKeyClass(ValuesGenerated.class);
search.setMapOutputValueClass(IntWritable.class);
//TODO
search.setOutputKeyClass(NullWritable.class);
search.setOutputValueClass(Text.class);
return search.waitForCompletion(true) ? 0 : 1;
Map merging (in cleanup phase):
@Override
public void cleanup(Context context) throws IOException, InterruptedException {
InputSplit split = context.getInputSplit();
Class<? extends InputSplit> splitClass = split.getClass();
FileSplit fileSplit = null;
if (splitClass.equals(FileSplit.class)) {
fileSplit = (FileSplit) split;
} else if (splitClass.getName().equals(
"org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit")) {
// begin reflection hackery...
try {
Method getInputSplitMethod = splitClass
.getDeclaredMethod("getInputSplit");
getInputSplitMethod.setAccessible(true);
fileSplit = (FileSplit) getInputSplitMethod.invoke(split);
} catch (Exception e) {
// wrap and re-throw error
throw new IOException(e);
}
// end reflection hackery
}
String filename = fileSplit.getPath().getName();
boolean isKnown;
/*
the two input files are nominated dinamically;
the file0 has some name "023901.txt",
the file1 is the output of a precedent MR job, and is
something like "chars-r-000000"
*/
if(filename.contains(".txt")) {
isKnown = false;
}
else {
isKnown = true;
}
if(isKnown) { //file1, known
ValuesGenerated.setName(new Text(name));
//other values set
//...
context.write(ValuesGenerated, new IntWritable(1));
}
else { //file0, unknown
ValuesGenerated.setName(new Text("unknown"));
//other values set
//...
context.write(ValuesGenerated, new IntWritable(0));
}
}
Reduce phase:
public static class Reduce extends Reducer<ValuesGenerated, IntWritable, NullWritable, Text> {
@Override
public void reduce(ValuesGenerated key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
ValuesGenerated known;
ValuesGenerated unk;
String toEmit = null;
for (IntWritable value : values) {
if(value.get() == 1) { //known
known = key;
toEmit = key.toString();
toEmit += "\n " + value;
context.write(NullWritable.get(), new Text(toEmit));
}
else { //unknown
unk = key;
toEmit = key.toString();
toEmit += "\n " + value;
context.write(NullWritable.get(), new Text(toEmit));
}
}
}//end reduce
} //end Reduce class
I encountered another problem, but I bypassed it with this solution hadoop MultipleInputs fails with ClassCastException