How to get the document ID in a Mapper with MultipleInputs

Question

I'm codding a TF-IDF using Hadoop in Java (no Pig or Hive) for learning purposes. I'm going to split it in three rounds: word count, word count per doc and finally docCount per word.

I believe main chain of jobs is correct, however I am having a question right in the beggining: in my first round, how do I get the document Id inside the mapper? I have the following multiple inputs:

    Path doc1 = new Path(System.getProperty("user.dir") + "/1.dat");
    Path doc2 = new Path(System.getProperty("user.dir") + "/2.dat");
    Path doc3 = new Path(System.getProperty("user.dir") + "/3.dat");
    MultipleInputs.addInputPath(job1, doc1, TextInputFormat.class, MapperRoundOne.class);
    MultipleInputs.addInputPath(job1, doc2, TextInputFormat.class, MapperRoundOne.class);
    MultipleInputs.addInputPath(job1, doc3, TextInputFormat.class, MapperRoundOne.class);

Round 1:
Mapper {docID => [words]} --> {[word, docID] => 1}
Reducer {[word, docId] => [1,1,...]} --> {[word, docId] => wordCount}

I could obviously set each input to a different mapper, and hard code the docID, which obviously is not generic. How to do that?

score 2 · Accepted Answer · edited May 23 '17 at 10:25

2

You can get it using

name = ((FileSplit)context.getInputSplit()).getPath().toString();

Refer Hadoop searching words from one file in another file

and also

http://bigdataspeak.wordpress.com/2013/03/24/hadoop-how-to-get-the-file-path-of-the-input-record-being-read-in-mapper/

HTH

edited May 23 '17 at 10:25

Community

1
1

answered May 20 '13 at 13:43

Eswara Reddy Adapa

995
5
11

Thanks! What really worked for me was this answer: http://stackoverflow.com/a/11130420/363855 – Pedro Dusso May 20 '13 at 15:27

How to get the document ID in a Mapper with MultipleInputs

1 Answers1