2

I am performing PCA analysis by using Sklearn and GPflow. I noticed that the output returned by both the libraries doesn't match.

Please see below the sample code snippet-

import numpy as np
from gpflow.models import PCA_reduce
from sklearn.decomposition import PCA

X = np.random.random((100, 10))

for n in range(1, 6):
    X1 = PCA(n_components=n).fit_transform(X)
    X2 = PCA_reduce(X, n)
    print('[n=%d] allclose=%s' % (n, np.allclose(X1, X2)))

Below is the output-

[n=1] allclose=True
[n=2] allclose=False
[n=3] allclose=False
[n=4] allclose=False
[n=5] allclose=False

It matches only when the number of principal components is 1. Why such behavior?

fuglede
  • 17,388
  • 2
  • 54
  • 99
ravi
  • 6,140
  • 18
  • 77
  • 154

1 Answers1

3

Two different issues are conspiring here:

  1. The order of the eigenvalues are reversed for the two methods. In the sklearn implementation, eigenvectors are ordered by decreasing magnitude of their eigenvalue, while in the gpflow implementation, they're ordered by increasing magnitude. In particular, you should be comparing PCA(n).fit_transform(X) to PCA_reduce(X, n)[:, ::-1]. This of course also explains why you get what you expect when using only a single component.

  2. However, this in itself won't suffice: If $v$ is a length 1 eigenvector with a given eigenvalue, then $-v$ is as well, so you can not simply use np.allclose to determine if the results are coherent; you need to take into account the potential reversal. So, instead what you could use would be something like a = np.all(np.isclose(X1, X2), 0) to compare the vectors directly, b = np.all(np.isclose(X1, -X2), 0) (noting the minus) to compare them when all vectors in X2 are reversed, since then, a | b becomes the condition that they agree up to reversal. Then finally, np.all(a | b) will check that this holds true for each eigenvector.

Indeed, the following modification to your test spits out all trues:

In [74]: for n in range(1, 6):
    ...:     X1 = PCA(n_components=n).fit_transform(X)
    ...:     X2 = PCA_reduce(X, n)[:, ::-1]
    ...:     print('[n=%d] allclose=%s' % (n, np.all(np.all(np.isclose(X1, X2), 0) | np.all(np.isclose(X1, -X2), 0))))

[n=1] allclose=True
[n=2] allclose=True
[n=3] allclose=True
[n=4] allclose=True
[n=5] allclose=True
fuglede
  • 17,388
  • 2
  • 54
  • 99
  • Thanks, @fuglede for the detailed explanation. I have a quick question. How do you come to know that the order of the eigenvalues are reversed for the two methods? – ravi Nov 04 '18 at 01:42
  • 1
    @RaviJoshi: I simply checked the outputs for both of them and noticed that they were reversed. For sklearn I knew what to expect. For GPflow, since is open source, it was straightforward to [look up what the thing does](https://github.com/GPflow/GPflow/blob/20a006fa87954bd9d131be917ce0187bdf5fc9b5/gpflow/models/gplvm.py#L209) and thereby verify its behavior. – fuglede Nov 04 '18 at 20:23