Computing Gaussian Mixture Model with Expectation Maximization using Mapreduce

Question

Where I can find a Mapreduce implementation(or tutorial) for computing GMM using EM?
Also is it really feasible to do it using mapreduce framework?

I have found this lecture link but it does not contains detail description, also if my data(mean and variance) is large then how do I pass it from mapper to reducer?

Have you tried to implement the EM algorithm of GMM on a single machine? — emesday, Apr 21 '14 at 23:09

emesday · Accepted Answer · 2014-04-26T07:46:55.783

OK, You said, you have experience implementing EM algorithm. It's easy to explain.

EM algorithm

As you know, EM algorithm is an iterative method for finding maximum likelihood. One iteration is consist of two steps that Expectation(E-step) and Maximization step (M-step).

In E-step, the likelihoods of the samples are calculated from the previous model. Let n be the number of samples, we can get n likelihoods.

Here, the likelihood calculation is performed independently. So this can be performed using multi-processor environment.

Assume that we have 4 CPUs on single machine, n/4 likelihoods can are calculated by each CPU. It's 4 times faster (the IO time is ignored)

In M-step, the new model is derived by the likelihoods.

To mapreduce

Computation EM

E-step can be extend to mapper tasks on mapreduce because of independent each other samples.
- input
  - key: anything
  - value: a sample
- output
  - key: anything
  - value: likelihood of the input sample
M-step can be extend to reducer task (this can be multiple reducers, but simply I recommend one reducer).
- input
  - key: anything
  - value: likelihood
- output
  - key: anything
  - value: the next model

data representation

the training samples are located at some directory on HDFS
- this will be input of mapreduce
the previous model is also located at HDFS but different directory to training samples
- USE distributed cache to let mappers know where the model is

iteration

One mapreduce task is analogues to one iteration of EM algorithm. So you need to iterate the next mapreduce task until converged

I have explained briefly. You will face many problems while you implement.

I hope this helps.