6

I am trying to perform kmeans clustering over 128-dimensional points (descriptors of interest points in an image). When I use scipy.cluster.vq.kmeans2 function, I sometimes get the following error:

  File "main.py", line 21, in level_routine
current.centroids, current.labels = cluster.vq.kmeans2( current.descriptors, k)
  File "/usr/lib/python2.7/dist-packages/scipy/cluster/vq.py", line 706, in kmeans2
    clusters = init(data, k)
  File "/usr/lib/python2.7/dist-packages/scipy/cluster/vq.py", line 593, in _krandinit
    return init_rankn(data)
  File "/usr/lib/python2.7/dist-packages/scipy/cluster/vq.py", line 586, in init_rankn
    x = np.dot(x, np.linalg.cholesky(cov).T) + mu
  File "/usr/lib/python2.7/dist-packages/numpy/linalg/linalg.py", line 603, in cholesky
    return wrap(gufunc(a, signature=signature, extobj=extobj).astype(result_t))
  File "/usr/lib/python2.7/dist-packages/numpy/linalg/linalg.py", line 93, in _raise_linalgerror_nonposdef
    raise LinAlgError("Matrix is not positive definite")
numpy.linalg.linalg.LinAlgError: Matrix is not positive definite

I know that this has something to do with the random initialization because on the same data and for the same k, I sometimes do not get this error.

My data is a numpy matrix with 128 columns and variable number of rows. I am not constructing the co-variance matrix, and hence do not have control over the same. Is there a way of getting rid of this error.

Avikalp Gupta
  • 63
  • 1
  • 9
  • Since you are getting error `Matrix is not positive definite`, similar question's answers might help, take a look at : http://stackoverflow.com/questions/21604498/numpy-cholesky-decomposition-linalgerror – vcp Mar 04 '16 at 05:29

1 Answers1

13

try changing minit parameter to 'points':

kmeans2(obs,k,minit='points')
Alireza Afzal Aghaei
  • 1,184
  • 10
  • 27
  • Do you know why this solves it? Ie why the default of `minit="random"` (as in scipy v1.5.3) raises this? – bricoletc Oct 23 '20 at 09:45
  • @bricoletc The problem stems from the initial points. The default "random" argument uses scalar values for the centroids. If data points are n-dimensional vectors, an error will be raised. By setting the initial centroid selection to "points", the algorithm uses some of the data points as centroids and the problem will be solved. – Alireza Afzal Aghaei Oct 23 '20 at 23:26