0

I have my own implementation of the Expectation Maximization (EM) algorithm based on this paper, and I would like to compare this with the performance of another implementation. For the tests, I am using k centroids with 1 Gb of txt data and I am just measuring the time it takes to compute the new centroids in 1 iteration. I tried it with an EM implementation in R, but I couldn't, since the result is plotted in a graph and gets stuck when there's a large number of txt data. I was following the examples in here.

Does anybody know of an implementation of EM that can measure its performance or know how to do it with R?

Chris
  • 1,206
  • 2
  • 15
  • 35
Walter white
  • 173
  • 2
  • 11
  • What do you mean, compare? In terms of time taken? Parameters will almost certainly be different since EM is sensitive to initialisation (only finds local optima). The standard way to check is to simulate according to the model and try to recover its parameters. – Ben Allison Jul 15 '14 at 12:52
  • Hello Ben. Yes, time performance is what I want to measure. – Walter white Jul 15 '14 at 12:58

1 Answers1

1

Fair benchmarking of EM is hard. Very hard.

  1. the initialization will usually involve random, and can be very different. For all I know, the R implementation by default uses hierarchical clustering to find the initial clusters. Which comes at O(n^2) memory and most likely at O(n^3) runtime cost. In my benchmarks, R would run out of memory due to this. I assume there is a way to specify initial cluster centers/models. A random-objects initialization will of course be much faster. Probably k-means++ is a good way to choose initial centers in practise.

  2. EM theoretically never terminates. It just at some point does not change much anymore, and thus you can set a threshold to stop. However, the exact definition of the stopping threshold varies.

  3. There exist all kinds of model variations. A method only using fuzzy assignments such as Fuzzy-c-means will of course be much faster than an implementation using multivariate Gaussian Mixture Models with a covaraince matrix. In particular with higher dimensionality. Covariance matrixes also need O(k * d^2) memory, and the inversion will take O(k * d^3) time, and thus is clearly not appropriate for text data.

  4. Data may or may not be appropriate. If you run EM on a data set that actually has Gaussian clusters, it will usually work much better than on a data set that doesn't provide a good fit at all. When there is no good fit, you will see a high variance in runtime even with the same implementation.

For a starter, try running your own algorithm several times with different initialization, and check your runtime for variance. How large is the variance compared to the total runtime?

You can try benchmarking against the EM implementation in ELKI. But I doubt the implementation will work with sparse data such as text - that data just is not Gaussian, it is not proper to benchmark. Most likely it will not be able to process the data at all because of this. This is expected, and can be explained from theory. Try to find data sets that are dense and that can be expected to have multiple gaussian clusters (sorry, I can't give you many recommendations here. The classic Iris and Old Faithful data sets are too small to be useful for benchmarking.

Erich Schubert
  • 8,575
  • 2
  • 26
  • 42
  • Hello Erich, Thanks a lot for your very nice explanation. At the moment I am just testing the speed performance of my implementation with another framework implementation of 1 Iteration. So for 1 - I just chose some random clusters. 2 - I am stopping after one iteration not by a threshold. 3 - My implementation of EM is not making use of a Covariance matrix. I am assuming all the Attributes (dimensions) are independent. Are all the EM implementations in R make use of a Covariance Matrix? if it is so, then I do not think I can use it for comparison. I'll check EM in ELKI at the moment. Thanks! – Walter white Jul 16 '14 at 08:24