Write output of two different Hadoop jobs to same set of reducers

Question

I have a scenario where I need to run two Hadoop jobs calculating n-gram statistics for two different corpora and make sure that they write each n-gram (and it's score) to the same reducer (so that in future I can read the data locally and compare and contrast two scores from two corpora). For e.g. if job J1 executes one of its reducers on machine M and writes n-gram N locally, I would like job J2 to also write n-gram N to the same machine M.

I know how to compute n-gram statistics for a corpora (for reference, one can refer to this pubication from Google). I have also defined my custom partitioner (taking hash based on first two words in the n-gram). Now how do I make sure that two different runs of the same program (on two different corpora) end up writing corresponding output to the same reducers?

score 0 · Answer 1 · answered Oct 08 '13 at 08:07

0

Check out MultipleInputs. By pointing two sibling mappers against sibling datasets you can avoid running an ID map on the combined set before reducing.

answered Oct 08 '13 at 08:07

Judge Mental

5,209
17
22

Write output of two different Hadoop jobs to same set of reducers

1 Answers1