I've been exploring different dimensionality reduction algorithms, specifically PCA and T-SNE. I'm taking a small subset of the MNIST dataset (with ~780 dimensions) and attempting to reduce the raw down to three dimensions to visualize as a scatter plot. T-SNE can be described in great detail here.
I'm using PCA as an intermediate dimensional reduction step prior to T-SNE, as described by the original creators of T-SNE on the source code from their website.
I'm finding that T-SNE takes forever to run (10-15 minutes to go from a 2000 x 25 to a 2000 x 3 feature space), while PCA runs relatively quickly (a few seconds for a 2000 x 780 => 2000 X 20).
Why is this the case? My theory is that in the PCA implementation (directly from primary author's source code in Python), he utilizes Numpy dot product notations to calculate X
and X.T
:
def pca(X = Math.array([]), no_dims = 50):
"""Runs PCA on the NxD array X in order to reduce its dimensionality to no_dims dimensions."""
print "Preprocessing the data using PCA..."
(n, d) = X.shape;
X = X - Math.tile(Math.mean(X, 0), (n, 1));
(l, M) = Math.linalg.eig(Math.dot(X.T, X));
Y = Math.dot(X, M[:,0:no_dims]);
return Y;
As far as I recall, this is significantly more efficient than scalar operations, and also means that only 2N (where N
is the number of rows) of data is loaded into memory (you need to load one row of X
and one column of X.T
).
However, I don't think this is the root reason. T-SNE definitely also contains vector operations, for example, when calculating the pairwise distances D
:
D = Math.add(Math.add(-2 * Math.dot(X, X.T), sum_X).T, sum_X);
Or, when calculating P (higher dimension) and Q (lower dimension). In t-SNE, however, you have to create two N X N matrices to store your pairwise distances between each data, one for its original high-dimensional space representation and the other for its reduced dimensional space.
In computing your gradient, you also have to create another N X N
matrix called PQ
, which is P - Q
.
It seems to me that the memory complexity here is the bottleneck. T-SNE requires 3N^2 of memory. There is no way this can fit in local memory, so the algorithm experiences significant cache line misses and needs to go to global memory to retrieve the values.
Is this correct? How do I explain to a client or a reasonable non-technical person why t-SNE is slower than PCA?
The co-author's Python implementation is found here.