1

I’ve inherited some code that was written about a year ago, so I guess back then it was using numpy 1.13 (now v1.15.2), scipy 1.00rc (now v1.1.0), and sklearn 0.19 (now v.0.20.0).

It implements Fisher’s LDA to reduce an n-dimensional space to an 1…n-1 dimensional space which produces a numpy array of complex numbers as its result (due to floating-point imprecision). That array is then cheery-picked and fed into sklearn.cluster.MeanShift which immediately throws an exception:

  File "/…/lib/python3.6/site-packages/sklearn/cluster/mean_shift_.py", line 416, in fit
    X = check_array(X)
  File "/…/lib/python3.6/site-packages/sklearn/utils/validation.py", line 531, in check_array
    _ensure_no_complex_data(array)
  File "/…/lib/python3.6/site-packages/sklearn/utils/validation.py", line 354, in _ensure_no_complex_data
    "{}\n".format(array))
ValueError: Complex data not supported

I am still learning the mathematical details of what’s going on here, but it strikes me as odd that this code was declared “runnable”.

Am I missing something here? Have version changes brought about this regression, or is there a more fundamental code flaw? How would I go about fixing this issue?

Jens
  • 8,423
  • 9
  • 58
  • 78
  • @PaulPanzer, many values in `X` are `1.70078660e+00+0.j`, and a fairly small number of values is e.g. `3.10415554e+00-0.31921105j`. I could just call [`X.real`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.real.html) and proceed with that result, and see what happens… ? – Jens Oct 03 '18 at 00:27
  • The complex arrays are the Eigenvalues and Eigenvectors of the covariance matrix in LDA. In fact, there is a `warnings.simplefilter('ignore', np.ComplexWarning)` right after, which indicates some intent to deal with complex results. However, a few lines down the code then raises the assertion, rendering it IMO not “runnable”. (PS: `np.iscomplex(X)` returns an array of rather sparse `True` values.) – Jens Oct 03 '18 at 00:45
  • Well… `eigvals, eigvecs = np.linalg.eig(Y)` where `Y` here is real (`np.iscomplexobj(Y) → False`), and the results are then complex. – Jens Oct 03 '18 at 01:04
  • Hmm It seems that `np.allclose(Y, Y.T)` gives me `False` (according to [this answer](https://stackoverflow.com/questions/42908334/checking-if-a-matrix-is-symmetric-in-numpy#42913743)), i.e. `Y` isn’t symmetric. However, `eigh(Y)` does return a real matrix. – Jens Oct 03 '18 at 01:38
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/181190/discussion-between-jens-and-paul-panzer). – Jens Oct 03 '18 at 03:35
  • I've made an answer, since one doesn't get notified of new chat entries. – Paul Panzer Oct 06 '18 at 08:52

1 Answers1

0

In comments/chat we have identified at least one problem which is that the numerical eigen decomposition of

(cov_w + I)^-1 @ cov_b                       (1)

is not real as it should but returns significant imaginary components. Here @ is matrix multiplication, cov_w and cov_b are covariance matrices and I is the identity matrix. This can be fixed by computing the matrix square root of (cov_w + I)^-1 lets call it SQ and then using the fact that (1) is similar to

SQ @ cov_b @ SQ                              (2)

hence has the same eigenvalues and if V are the eigenvectors of (2) then the (right) eigenvectors of (1) are SQ @ V.

What we have gained is that because (2) is a symmetric matrix its eigen decomposition can be computed using numpy.linalg.eigh which guarantees purely real results. eigh can also be used to compute SQ, see here. Be sure to bypass the inverse and apply eigh directly on cov_w + I or even cov_w.

Paul Panzer
  • 51,835
  • 3
  • 54
  • 99