0

I have a scenario where I need to run two Hadoop jobs calculating n-gram statistics for two different corpora and make sure that they write each n-gram (and it's score) to the same reducer (so that in future I can read the data locally and compare and contrast two scores from two corpora). For e.g. if job J1 executes one of its reducers on machine M and writes n-gram N locally, I would like job J2 to also write n-gram N to the same machine M.

I know how to compute n-gram statistics for a corpora (for reference, one can refer to this pubication from Google). I have also defined my custom partitioner (taking hash based on first two words in the n-gram). Now how do I make sure that two different runs of the same program (on two different corpora) end up writing corresponding output to the same reducers?

abhinavkulkarni
  • 2,284
  • 4
  • 36
  • 54

1 Answers1

0

Check out MultipleInputs. By pointing two sibling mappers against sibling datasets you can avoid running an ID map on the combined set before reducing.

Judge Mental
  • 5,209
  • 17
  • 22