OK, You said, you have experience implementing EM algorithm. It's easy to explain.
EM algorithm
As you know, EM algorithm is an iterative method for finding maximum likelihood. One iteration is consist of two steps that Expectation(E-step) and Maximization step (M-step).
In E-step, the likelihoods of the samples are calculated from the previous model. Let n
be the number of samples, we can get n
likelihoods.
Here, the likelihood calculation is performed independently. So this can be performed using multi-processor environment.
Assume that we have 4 CPUs on single machine, n/4
likelihoods can are calculated by each CPU. It's 4 times faster (the IO time is ignored)
In M-step, the new model is derived by the likelihoods.
To mapreduce
Computation EM
- E-step can be extend to
mapper
tasks on mapreduce because of independent each other samples.
- input
- key: anything
- value: a sample
- output
- key: anything
- value: likelihood of the input sample
- M-step can be extend to
reducer
task (this can be multiple reducers, but simply I recommend one reducer).
- input
- key: anything
- value: likelihood
- output
- key: anything
- value: the next model
data representation
- the training samples are located at some directory on
HDFS
- this will be input of
mapreduce
- the previous model is also located at
HDFS
but different directory to training samples
- USE distributed cache to let
mappers
know where the model is
iteration
One mapreduce task is analogues to one iteration of EM algorithm. So you need to iterate the next mapreduce task until converged
I have explained briefly. You will face many problems while you implement.
I hope this helps.