15

I am using scikit-learn. The nature of my application is such that I do the fitting offline, and then can only use the resulting coefficients online(on the fly), to manually calculate various objectives.

The transform is simple, it is just data * pca.components_, i.e. simple dot product. However, I have no idea how to perform the inverse transform. Which field of the pca object contains the relevant coefficients for the inverse transform? How do I calculate the inverse transform?

Specifically, I am referring to the PCA.inverse_transform() method call available in the sklearn.decomposition.PCA package: how can I manually reproduce its functionality using various coefficients calculated by the PCA?

yangjie
  • 6,619
  • 1
  • 33
  • 40
Baron Yugovich
  • 3,843
  • 12
  • 48
  • 76
  • Inverse transform is present in the pca module of scikit-learn, I just want to be able to run it manually. What it does is, it takes a data point from the reduced space, and takes it back(with information loss, of course) to the original space. – Baron Yugovich Sep 23 '15 at 23:26
  • I don't think so. The matrix dimensions don't work out, to begin. – Baron Yugovich Sep 24 '15 at 00:50

1 Answers1

32

1) transform is not data * pca.components_.

Firstly, * is not dot product for numpy array. It is element-wise multiplication. To perform dot product, you need to use np.dot.

Secondly, the shape of PCA.components_ is (n_components, n_features) while the shape of data to transform is (n_samples, n_features), so you need to transpose PCA.components_ to perform dot product.

Moreover, the first step of transform is to subtract the mean, therefore if you do it manually, you also need to subtract the mean at first.

The correct way to transform is

data_reduced = np.dot(data - pca.mean_, pca.components_.T)

2) inverse_transform is just the inverse process of transform

data_original = np.dot(data_reduced, pca.components_) + pca.mean_

If your data already has zero mean in each column, you can ignore the pca.mean_ above, for example

import numpy as np
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
pca.fit(data)

data_reduced = np.dot(data, pca.components_.T) # transform
data_original = np.dot(data_reduced, pca.components_) # inverse_transform
yangjie
  • 6,619
  • 1
  • 33
  • 40
  • 1
    When writing * above, I was not writing code, but psueodocde, i.e. writing the idea informally. As for subtracting the mean, that's understood, right, X, the input matrix, should already have each column with mean 0 and stdev 1, i.e. it is already standardized anyway, right? Thus, further tempering with the mean would not be necessary. However, if you're trying to express how to transform the original data, before standardization, can you please write it in a cleaner, more step-by-step process? Then I am ready to accept your answer. – Baron Yugovich Sep 24 '15 at 12:32
  • 1
    Yes, if your data already has each column with mean 0, you do not need to temper with the mean. The steps are actually simple, I have provided a more complete example, please point out if you are unclear about any part. – yangjie Sep 24 '15 at 14:00
  • 1
    One more question: you address the mean, but how about the variance? You don't mention anything about ensuring that st.dev=1. – Baron Yugovich Sep 24 '15 at 14:46
  • 1
    It is not necessary to ensure std=1. The PCA implemented by scikit-learn only centers the data but does not scale it. You can check that by looking at the source code https://github.com/scikit-learn/scikit-learn/blob/a95203b/sklearn/decomposition/pca.py#L99 – yangjie Sep 24 '15 at 15:13
  • You can normalize the data as preprocessing, but there is nothing to do with the PCA transform actually. What `inverse_transfrom` returns is only the preprocessed data. – yangjie Sep 24 '15 at 15:21
  • 2
    For doing it manually, and truncating dimensions, `data_reduced = np.dot(data, pca.components_.T[:,:dim])`, and back `data_original = np.dot(data_reduced, pca.components_[:dim, :])` – Gulzar Nov 30 '19 at 20:38