1

The purpose is to know how much a file X is similar to files y1, y2, ..., yn.

For each file, I extract informations and I store them in structures; let's say that from a file I do a word count, and I store results in a HashMap<String, Integer> wordCount (there are other structures storing other infos).

So I need to generate wordCount of fileX; extract wordCount of fileY (pre-generated and written on HDFS files); calculate how much these two word counts are similar (I cannot make a line-by-line difference; I need similarity in percentage).

FileX is fixed and needs to be compared to N fileY.

So my idea was:

Job1: calculate fileX informations and writing them on HDFS.

Job2 (chainMapper of map1-map2):

Map1: read HashMap<String, Integer> wordCount of fileX; passing structures to Map2.

Map2: gets 2 inputs, structures of fileX, path to directory of fileYs.

Map2 calculates the similarity of HashMap<String, Integer> wordCountX and HashMap<String, Integer> wordCountY; the reducer gets all the values of similarity and order them.

I have read on Hadoop - The definitive guide of Tom White and online too about MultipleInputs, but it is not about two inputs to 1 mapper, but to differentiate mappers based on inputs. So I want to ask how to forward two values to a single mappers; I have considered using distributed cache but it's not something useful for this problem; and last, how to be sure that each mapper gets a different fileY.

I have tried to update a global HashMap<String, Integer> wordCount but when a new job starts, it cannot access that structure (or better, it's empty).

public class Matching extends Configured implements Tool{

    private static HashMap<String, Integer> wordCountX;

    public static void main(String[] args) throws Exception {

        int res = ToolRunner.run(new Matching(), args);
        System.exit(res);

    } //end main class

    public int run(String[] args) throws Exception {
        ...
    }

}

EDIT:

The answer of behold is a nice solution.

I add the resultant code snippet.

Launching job:

//configuration and launch of job
        Job search = Job.getInstance(getConf(), "2. Merging and searching");
        search.setJarByClass(this.getClass());

        MultipleInputs.addInputPath(search, creationPath, TextInputFormat.class);
        MultipleInputs.addInputPath(search, toMatchPath, TextInputFormat.class);

        FileOutputFormat.setOutputPath(search, resultPath);
        search.setNumReduceTasks(Integer.parseInt(args[2]));

        search.setMapperClass(Map.class);
        search.setReducerClass(Reduce.class);

        search.setMapOutputKeyClass(ValuesGenerated.class);
        search.setMapOutputValueClass(IntWritable.class);
        //TODO
        search.setOutputKeyClass(NullWritable.class);
        search.setOutputValueClass(Text.class);

        return search.waitForCompletion(true) ? 0 : 1;

Map merging (in cleanup phase):

@Override
        public void cleanup(Context context) throws IOException, InterruptedException {

            InputSplit split = context.getInputSplit();
            Class<? extends InputSplit> splitClass = split.getClass();

            FileSplit fileSplit = null;
            if (splitClass.equals(FileSplit.class)) {
                fileSplit = (FileSplit) split;
            } else if (splitClass.getName().equals(
                    "org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit")) {
                // begin reflection hackery...

                try {
                    Method getInputSplitMethod = splitClass
                            .getDeclaredMethod("getInputSplit");
                    getInputSplitMethod.setAccessible(true);
                    fileSplit = (FileSplit) getInputSplitMethod.invoke(split);
                } catch (Exception e) {
                    // wrap and re-throw error
                    throw new IOException(e);
                }

                // end reflection hackery
            }

            String filename = fileSplit.getPath().getName();
            boolean isKnown;

            /*
            the two input files are nominated dinamically;
            the file0 has some name "023901.txt",
            the file1 is the output of a precedent MR job, and is
            something like "chars-r-000000"
            */
            if(filename.contains(".txt")) {
                isKnown = false;
            }
            else {
                isKnown = true;
            }

            if(isKnown) { //file1, known

                ValuesGenerated.setName(new Text(name));

                //other values set
                //...

                context.write(ValuesGenerated, new IntWritable(1));

            }
            else { //file0, unknown

                ValuesGenerated.setName(new Text("unknown"));

                //other values set
                //...

                context.write(ValuesGenerated, new IntWritable(0));

            }           
        }

Reduce phase:

public static class Reduce extends Reducer<ValuesGenerated, IntWritable, NullWritable, Text> {

            @Override
            public void reduce(ValuesGenerated key, Iterable<IntWritable> values, Context context) 
                    throws IOException, InterruptedException {

                ValuesGenerated known;
                ValuesGenerated unk;

                String toEmit = null;

                for (IntWritable value : values) {

                    if(value.get() == 1) { //known
                        known = key;
                        toEmit = key.toString();
                        toEmit += "\n " + value;
                        context.write(NullWritable.get(), new Text(toEmit));
                    }
                    else { //unknown
                        unk = key;
                        toEmit = key.toString();
                        toEmit += "\n " + value;
                        context.write(NullWritable.get(), new Text(toEmit));
                    }

                }

            }//end reduce

        } //end Reduce class

I encountered another problem, but I bypassed it with this solution hadoop MultipleInputs fails with ClassCastException

Pleasant94
  • 471
  • 2
  • 8
  • 21

2 Answers2

1

You can have multiple files input to same Mapper just by adding multiple file input path. You can then use mapperContext to identify which file split comes from which file location.

So basically,

Step 1: MR job

  • Read file 1+2

  • In mapper emit <word, [val1, val2]> ( val1 is 1 if file split from file1 and 0 otherwise; similar for val2)

  • in reducer write hashmap <work, [file1_count, file2_count]>

Step 2: merge the shards (wordcount can't be that large and should fit in a single machine) and use a simple java job to create custom similarity metric

behold
  • 538
  • 5
  • 19
0

Instead of the global, you could use a db or even write to file.

Check the proportion of the frequencies to the size of the HashMaps and compare:

HashMap<String, Integer> similarities = new HashMap<String, Integer>();
int matching = 0
Int totalX = getTotal(wordCountX);
int totalY = getTotal(wordCountY);

wordCountX.forEach((k,v)->{ 
     Integer count = wordCountY.get(k);
    if (count.getIntValue() / totalY == v.getIntValue() / totalX)
        similarities.put(k, Integer.valueOf(v.getIntValue() / totalY);
});
nohnce
  • 61
  • 5
  • For the similarity, your solution is not correct, because of the structure itself of the word counts; two files have always different values to same key; for example, txt1 has 100 words, 50% of them are "X", txt2 has 200 words, 50% of them are "X", I say txt1 and txt2 are very much similar over "X". – Pleasant94 Apr 20 '19 at 12:02
  • I edited my answer. What kind of application are you making? – nohnce Apr 21 '19 at 16:59
  • The purpose is to get a percentage of how much two files have similar wordCounts. – Pleasant94 Apr 21 '19 at 19:36