-1

I am using a dataset to make 2 clusters using EM and then K-means. I already have implemented K-means and EM Algorithm separately. Now I am trying to derive k-means from my implementation of EM Algorithm to do clustering. I have 2 questions in mind.

  1. K-means is viewed as a special case of the generalized EM algorithm. But what assumptions do we need to make to derive k-means from EM algorithm?

  2. Also, in coding perspective, what changes do we need to make in implementation of EM algorithm so that it starts behaving exactly like k-means algorithm? I assume that we need to share same co-variance matrix between both clusters. Is that right to assume?

This is what I am getting using k-means.

Clustering K-means

This is clustering using EM.

Clustering EM

TylerH
  • 20,799
  • 66
  • 75
  • 101
Haroon S.
  • 2,533
  • 6
  • 20
  • 39
  • I’m voting to close this question because it is not about programming but about data science/statistics. – TylerH Sep 10 '21 at 21:47

1 Answers1

1

K-means and EM clustering are very related, but not exactly the same. Two changes to EM would make it very, very similar to K-means:

  1. EM uses multi-dimensional distributions. Constrain the standard deviations of the distributions to be the same in all dimensions.
  2. Revise the output of EM to produce only the most probably cluster. EM produces soft-clusters (a separate probability that a point is in each cluster), whereas K-means produces hard clusters (a single cluster choice).

I don't know how these "fixes" translate into your particular code.

I'm not 100% sure that under all circumstances, such an EM approach would converge to exactly the same clusters as K-means. I am confident that the two methods would produce very comparable results under most circumstances.

Nimantha
  • 6,405
  • 6
  • 28
  • 69
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thank you for your response. Got it. If we use same Standard deviations (or co-variance) across all dimensions, it produces results similar to k-means. – Haroon S. Dec 11 '16 at 13:18
  • @SalA. . . . Yes. This happens to be a topic that I discuss in the clustering chapter of the third edition of "Data Mining Techniques". The "cross-section" of the clusters for EM are ellipses, and for K-means are circles. So enforcing the equality of the variances constrains the ellipses to be circles. – Gordon Linoff Dec 11 '16 at 13:44