Autoencoders for high dimensional data

Question

I'm working on a project where I need to reduce the dimensionality of my observations and still have a significative representation of them. The use of Autoencoders was strongly suggested for many reasons but I'm not quite sure it's the best approach.

I have 1400 samples of dimension ~60,000 which is far too high, I am trying to reduce their dimensionality to a 10% of the original one. I'm using theano autoencoders [Link] and it seems like the cost keeps being around 30,000 (which is very high). I tried raising the number of epochs or lowering the learning rate with no success. I'm not a big expert on autoencoders so I'm not sure how to proceed from here or when to just stop trying.

There are other tests I can run but, before going any further, I'd like to have an input from you.

Do you think the dataset is too small (I can add another 600 samples for a total of ~2000) ?
Do you think using stacked autoenoders could help ?
Should I keep tweaking the parameters (epochs and learning rate) ?

Since the dataset is an ensamble of pictures I tried to visualize the reconstructions from the autoencoders and all I got was the same output for every sample. This means that given the input the autoencoder tries to rebuild the input but what I get instead is the same (almost exactly) image for any input(which kind of looks like an average of all the images in the dataset). This means that the inner representation is not good enough since the autoencoder can't reconstruct the image from it.

The dataset: 1400 - 2000 images of scanned books (covers included) of around ~60.000 pixels each (which translates to a feature vector of 60.000 elements). Each feature vector has been normalized in [0,1] and originally had values in [0,255].

The problem: Reduce their dimensionality with Autoencoders (if possible)

If you need any extra info or if I missed something that might be useful to better understand the problem, please add a comment and I will happily help you help me =).

Note: I'm currently running a test with a higher number of epochs on the whole dataset and I will update my post accordingly to the result, it might take a while though.

Have you discounted alternate dimensionality reduction techniques such as Principle Components Analysis? — Daniel Renshaw, Sep 06 '15 at 19:05
I'm absolutely considering that, but I still wanted to understand why autoencoders are not working. — G4bri3l, Sep 06 '15 at 20:27
So I had a sentence here saying "what Daniel said", because you should try simple things first, and PCA will do as much for you as a one-layer autoencoder anyway (and you won't be able to train much more than that). Another suggestion would be to use a pre-trained convnet such as you can find in sklearn-theano, to choose one of the latest layers and collapse the channels spatially giving a signature of size `(n_channels,)` per book cover. That would capture some visual information. — eickenberg, Sep 07 '15 at 07:56

score 3 · Answer 1 · edited Jan 04 '18 at 18:55

There's no reason to necessarily consider a cost of 30,000 as "high" unless more is known about the situation than described in the question. The globally minimal cost might actually be around 30,000 if, for example, the size of the hidden layer is particularly small and there is little redundancy in the data.

If the cost is 30,000 before training (i.e. with random encoder and decoder weights) and remains around that level even after some training then something probably is wrong.

You should expect the cost to decrease after the first update (you'll have many updates per epoch if you're using minibatch stochastic gradient descent). You should also expect the convergence cost to decrease as the size of the hidden layer is increased.

Other techniques that might help in a situation like this include the denoising autoencoder (which can be thought of as artificially increasing the size of your training dataset by repeated application of random noise) and the contractive autoencoder which focusses its regularization power on the encoder, the part you care about. Both can be implemented in Theano and the first is the subject of this tutorial (with code).

score 3 · Answer 2 · answered Feb 01 '16 at 18:39

Autoencoders are useful in part because they can learn nonlinear dimensionality reductions. There are other dimensionality reduction techniques, however, which are much faster than autoencoders. Diffusion maps is a popular one; locally-linear embedding is another. I've used diffusion maps on >2000 60k-dimensional data (also images) and it works in under a minute.

Here's a straightforward Python implementation using numpy et al:

def diffusion_maps(data, d, eps=-1, t=1):
    """
    data is organized such that columns are points. so it's 60k x 2k for you
    d is the target dimension
    eps is the kernel bandwidth, estimated automatically if == -1
    t is the diffusion time, 1 is usually fine
    """

    from scipy.spatial import pdist, squareform
    from scipy import linalg as la
    import numpy as np

    distances = squareform(pdist(data.T))

    if eps == -1:
        # if a kernel bandwidth was not supplied,
        # just use the distance to the tenth-nearest neighbor
        k = 10
        nn = np.sort(distances)
        eps = np.mean(nn[:, k + 1])

    kernel = np.exp(-distances ** 2 / eps ** 2)
    one = np.ones(n_samples)
    p_a = np.dot(kernel, one)
    kernel_p = walk / np.outer(p_a, p_a)
    dd = np.dot(kernel_p, one) ** 0.5
    walk = kernel_p / np.outer(dd, dd)

    vecs, eigs, _ = la.svd(walk, full_matrices=False)
    vecs = vecs / vecs[:, 0][:, None]
    diffusion_coordinates = vecs[:, 1:d + 1].T * (eigs[1:d + 1][:, None] ** t)

    return diffusion_coordinates

The gist of diffusion maps is that you form a random walk on your data such that you're much much more likely to visit nearby points than far-away ones. Then you can define a distance between points (the diffusion distance), which is in essence an average probability of moving between two points over all possible paths. The trick is this is actually extremely easy to compute; all you need to do is diagonalize a matrix, and then embed your data in low-dimensional space using its eigenvectors. In this embedding the Euclidean distance is the diffusion distance, up to an approximation error.

Could you please further explain what "n_samples" is? Also, the code uses the variable "walk" before assignment. Could you please clarify? Thanks. — user823743, Dec 09 '16 at 00:18

score 1 · Answer 3 · answered Jan 06 '18 at 13:15

Simple things first... Note that if you have only 1400 points in 60,000 dimensional space, then you can without loss, reduce dimensionality to size <=1400. That is a simple mathematical fact: your data matrix is 1400x60,000, so its rank (dimensionality) is at most 1400. Thus, Principal Components Analysis (PCA) will produce 1400 points in 1400 dimensional space, without loss. I strongly suggest using PCA to reduce the dimensionality of your data before considering anything else.

Autoencoders for high dimensional data

3 Answers3