I have a scenario where I need to run two Hadoop
jobs calculating n-gram
statistics for two different corpora and make sure that they write each n-gram (and it's score) to the same reducer (so that in future I can read the data locally and compare and contrast two scores from two corpora). For e.g. if job J1
executes one of its reducers on machine M
and writes n-gram N
locally, I would like job J2
to also write n-gram N
to the same machine M
.
I know how to compute n-gram statistics for a corpora (for reference, one can refer to this pubication from Google). I have also defined my custom partitioner (taking hash based on first two words in the n-gram). Now how do I make sure that two different runs of the same program (on two different corpora) end up writing corresponding output to the same reducers?