3

I am trying to implement a Mapreduce program to do wordcounts from 2 files, and then comparing the word counts from these files to see what are the most common words...

I noticed that after doing wordcount for file 1, the results that go into the directory "/data/output1/", there are 3 files inside. - "_SUCCESS" - "_logs" - "part-r-00000" The "part-r-00000" is the file that contains the results from file1 wordcount. How do I make my program read that particular file if the file name is generated in real-time without me knowing beforehand the filename?

Also, for the (key, value) pairs, I have added an identifier to the "value", so as to be able to identify which file and count that word belongs to.

public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
            Text newValue = new Text();
            newValue.set(value.toString() + "_f2");
            context.write(key, newValue);
}

at a later stage, how do i "remove" the identifier such that i can just get the "value"?

Jon
  • 71
  • 1
  • 1
  • 5

2 Answers2

3

Just point your next MR job to /data/output1/. It will read all three files as input, but _SUCCESS and _logs are both empty so they'll have no affect on your program. They're just written that way so that you can tell that the MR job writing to the directory has finished successfully.

Chris Gerken
  • 16,221
  • 6
  • 44
  • 59
1

If you want to implement word count from 2 different files then you could use multipleinput class with help of which you can apply map reduce program on both files simultaneously. Refer this link for a example of how to implement it http://www.hadooptpoint.com/hadoop-multiple-input-files-example-in-mapreduce/ here you will define separate mapper for each input file thus you can add different identifier in both mapper file and then when there output will go to reducer it can identify from which map file that input is coming from and can process accordingly to it. And you can remove identifier in same way you add them like for example if you add a prefix @ in mapper 1 output key and # in mapper 2 output key then in reducer you can identify from which map input is coming from using this prefix and then you can simple remove this prefix in reducer.

Aside from it about your other query related to file reading, it is simple the output file name aways have a pattern that if your are using hadoop1.x then result will be stored in file name as part-00000 and onward and with hadoop 2.x result will be stored in file name part-r-00000 if there is another output which need to be write in same ouput path then it will be stored in part-r-00001 and onwards. Other two files which are generated have no significance for developer they more of a act as a half for hadoop itself

Hope this solve your query. Please comment if answer is not clear.

siddhartha jain
  • 1,006
  • 10
  • 16