2

I've been exploring different dimensionality reduction algorithms, specifically PCA and T-SNE. I'm taking a small subset of the MNIST dataset (with ~780 dimensions) and attempting to reduce the raw down to three dimensions to visualize as a scatter plot. T-SNE can be described in great detail here.

I'm using PCA as an intermediate dimensional reduction step prior to T-SNE, as described by the original creators of T-SNE on the source code from their website.

I'm finding that T-SNE takes forever to run (10-15 minutes to go from a 2000 x 25 to a 2000 x 3 feature space), while PCA runs relatively quickly (a few seconds for a 2000 x 780 => 2000 X 20).

Why is this the case? My theory is that in the PCA implementation (directly from primary author's source code in Python), he utilizes Numpy dot product notations to calculate X and X.T:

def pca(X = Math.array([]), no_dims = 50):
    """Runs PCA on the NxD array X in order to reduce its dimensionality to no_dims dimensions."""

    print "Preprocessing the data using PCA..."
    (n, d) = X.shape;
    X = X - Math.tile(Math.mean(X, 0), (n, 1));
    (l, M) = Math.linalg.eig(Math.dot(X.T, X));
    Y = Math.dot(X, M[:,0:no_dims]);
    return Y;

As far as I recall, this is significantly more efficient than scalar operations, and also means that only 2N (where N is the number of rows) of data is loaded into memory (you need to load one row of X and one column of X.T).

However, I don't think this is the root reason. T-SNE definitely also contains vector operations, for example, when calculating the pairwise distances D:

D = Math.add(Math.add(-2 * Math.dot(X, X.T), sum_X).T, sum_X);

Or, when calculating P (higher dimension) and Q (lower dimension). In t-SNE, however, you have to create two N X N matrices to store your pairwise distances between each data, one for its original high-dimensional space representation and the other for its reduced dimensional space.

In computing your gradient, you also have to create another N X N matrix called PQ, which is P - Q.

It seems to me that the memory complexity here is the bottleneck. T-SNE requires 3N^2 of memory. There is no way this can fit in local memory, so the algorithm experiences significant cache line misses and needs to go to global memory to retrieve the values.

Is this correct? How do I explain to a client or a reasonable non-technical person why t-SNE is slower than PCA?

The co-author's Python implementation is found here.

Yu Chen
  • 6,540
  • 6
  • 51
  • 86
  • Through cursory glance, seems like `pca` is a series of vectorized ops and as such performant, while `t-sne` isn't. Off-topic probably, but we don't need that `tile` there, as NumPy broadcasting could be used at that step. – Divakar Aug 22 '17 at 18:48
  • Interesting. Regarding the `tile`, that was copied directly from the co-author's source code for PCA. `T-SNE` also has several vectorized operations. I'll edit the question to include some examples. – Yu Chen Aug 22 '17 at 18:53
  • My guess is that the core reason for t-SNE being slower is that there is no closed-form solution, it needs to iterate to approximate the answer. PCA is a straightforward calculation which bottle neck (`eig`) has been optimised to the extreme. – Marijn van Vliet Aug 22 '17 at 19:04

2 Answers2

3

t-SNE tries to lower the dimensionality while preserving the distributions of distances between elements.

This requires computing distances between all the points. Pairwise distance matrix has N^2 entries where N is the number of examples.

Jakub Bartczuk
  • 2,317
  • 1
  • 20
  • 27
2

The main reason for t-SNE being slower than PCA is that no analytical solution exists for the criterion that is being optimised. Instead, a solution must be approximated through gradient descend iterations.

In practice, this means lots of for loops. Not in the least the main iteration for-loop in line 129, that runs up to max_iter=1000 times. Additionally, the x2p function iterates over all data points with a for loop.

The reference implementation is optimised for readability, not for computational speed. The authors link to an optimised Torch implementation as well, which should speed up the computation a lot. If you want to stay in pure Python, I recommend the implementation in Scikit-Learn, which should also be a lot faster.

Marijn van Vliet
  • 5,239
  • 2
  • 33
  • 45
  • Thanks- this is very helpful and answers my question. Just out of curiosity, could you provide some thoughts on my question of memory complexity? Is it true that PCA in it of itself should only be a `2N` memory complexity whereas T-SNE appears to be `2N^2`? – Yu Chen Aug 23 '17 at 17:12