1

I am trying to implement Expectation Maximization algorithm(Gaussian Mixture Model) on a data set data=[[x,y],...]. I am using mv_norm.pdf(data, mean,cov) function to calculate cluster responsibilities. But after calculating new values of covariance (cov matrix) after 6-7 iterations, cov matrix is becoming singular i.e determinant of cov is 0 (very small value) and hence it is giving errors

ValueError: the input matrix must be positive semidefinite

and

raise np.linalg.LinAlgError('singular matrix')

Can someone suggest any solution for this?

#E-step: Compute cluster responsibilities, given cluster parameters
def calculate_cluster_responsibility(data,centroids,cov_m):
    pdfmain=[[] for i in range(0,len(data))]
    for i in range(0,len(data)):
        sum1=0
        pdfeach=[[] for m in range(0,len(centroids))]
        pdfeach[0]=1/3.*mv_norm.pdf(data[i], mean=centroids[0],cov=[[cov_m[0][0][0],cov_m[0][0][1]],[cov_m[0][1][0],cov_m[0][1][1]]])
        pdfeach[1]=1/3.*mv_norm.pdf(data[i], mean=centroids[1],cov=[[cov_m[1][0][0],cov_m[1][0][1]],[cov_m[1][1][0],cov_m[0][1][1]]])
        pdfeach[2]=1/3.*mv_norm.pdf(data[i], mean=centroids[2],cov=[[cov_m[2][0][0],cov_m[2][0][1]],[cov_m[2][1][0],cov_m[2][1][1]]])
        sum1+=pdfeach[0]+pdfeach[1]+pdfeach[2]
        pdfeach[:] = [x / sum1 for x in pdfeach]
        pdfmain[i]=pdfeach

    global old_pdfmain
    if old_pdfmain==pdfmain:
        return
    old_pdfmain=copy.deepcopy(pdfmain)
    softcounts=[sum(i) for i in zip(*pdfmain)]
    calculate_cluster_weights(data,centroids,pdfmain,soft counts)

Initially, I've passed [[3,0],[0,3]] for each cluster covariance since expected number of clusters is 3.

Alex Riley
  • 169,130
  • 45
  • 262
  • 238
Madhura Raut
  • 51
  • 1
  • 5

1 Answers1

1

Can someone suggest any solution for this?

The problem is your data lies in some manifold of dimension strictly smaller than the input data. In other words for example your data lies on a circle, while you have 3 dimensional data. As a consequence when your method tries to estimate 3 dimensional ellipsoid (covariance matrix) that fits your data - it fails since the optimal one is a 2 dimensional ellipse (third dimension is 0).

How to fix it? You will need some regularization of your covariance estimator. There are many possible solutions, all in M step, not E step, the problem is with computing covariance:

  • Simple solution, instead of doing something like cov = np.cov(X) add some regularizing term, like cov = np.cov(X) + eps * np.identity(X.shape[1]) with small eps
  • Use nicer estimator like LedoitWolf estimator from scikit-learn.

Initially, I've passed [[3,0],[0,3]] for each cluster covariance since expected number of clusters is 3.

This makes no sense, covariance matrix values has nothing to do with amount of clusters. You can initialize it with anything more or less resonable.

lejlot
  • 64,777
  • 8
  • 131
  • 164