0

I have been learning about the EM algorithm by using the material of Andrew Ng from Stanford, the link is here: http://cs229.stanford.edu/notes/cs229-notes7b.pdf And I have been trying to understand the implementation of the EM by using a Python library, and specifically to work with the Old Faithful data set. The link is the following: https://mixem.readthedocs.io/en/latest/examples/old_faithful.html This data set has approximately 272 observations with two variables that are the eruption time and the waiting time; which has information of the timing between eruptions. I have a couple of questions about the following lines of code:

weights, distributions, ll = mixem.em(np.array(data), [
    mixem.distribution.MultivariateNormalDistribution(np.array((2, 50)), np.identity(2)),
    mixem.distribution.MultivariateNormalDistribution(np.array((4, 80)), np.identity(2)),
])

That is related to the following part: enter image description here and the questions that I have are:

  • Why do I have to created two arrays for mu and why to consider those dimensions? I suppose the first one (2,50) the first 2 refers to the number of variables (eruption and waiting), but why to put 50 as a second dimension. Also, why do I need the array of (4,80) and the two identity arrays with dimension 2?
Little
  • 3,363
  • 10
  • 45
  • 74

1 Answers1

1

You are trying to cluster data points in your problem. So, how do you do it when you don't know how many clusters are there or which points belong to which cluster. This is where EM comes in.

You make a few assumptions to solve the clustering problem. You assume that there are possibly 2 clusters. Now each point has two dimensions (eruption, waiting) so you would need a 2D gaussian to describe a cluster of such points. Since you assumed that there are 2 clusters so you create 2 multivariate gaussians.

In the example you create Gaussian1 with means (2,50) and identity covariance matrix (given by identity(2)). Similarly you create Gaussian2 with means (4,80).

Why did you choose values 2 and 50 for first gaussian? These are arbitrary numbers and you normally choose something reasonable here. So what you are saying is that the initial value of the mean of the component that refers to eruption is 2 and the mean for waiting is 50. If you looked at the dataset, you would find that these are reasonable initial estimates.

Rest of it is standard EM.

dgumo
  • 1,838
  • 1
  • 14
  • 18