I'm working on a project where I need to reduce the dimensionality of my observations and still have a significative representation of them. The use of Autoencoders was strongly suggested for many reasons but I'm not quite sure it's the best approach.
I have 1400 samples of dimension ~60,000 which is far too high, I am trying to reduce their dimensionality to a 10% of the original one. I'm using theano autoencoders [Link] and it seems like the cost keeps being around 30,000 (which is very high). I tried raising the number of epochs or lowering the learning rate with no success. I'm not a big expert on autoencoders so I'm not sure how to proceed from here or when to just stop trying.
There are other tests I can run but, before going any further, I'd like to have an input from you.
Do you think the dataset is too small (I can add another 600 samples for a total of ~2000) ?
Do you think using stacked autoenoders could help ?
Should I keep tweaking the parameters (epochs and learning rate) ?
Since the dataset is an ensamble of pictures I tried to visualize the reconstructions from the autoencoders and all I got was the same output for every sample. This means that given the input the autoencoder tries to rebuild the input but what I get instead is the same (almost exactly) image for any input(which kind of looks like an average of all the images in the dataset). This means that the inner representation is not good enough since the autoencoder can't reconstruct the image from it.
The dataset: 1400 - 2000 images of scanned books (covers included) of around ~60.000 pixels each (which translates to a feature vector of 60.000 elements). Each feature vector has been normalized in [0,1] and originally had values in [0,255].
The problem: Reduce their dimensionality with Autoencoders (if possible)
If you need any extra info or if I missed something that might be useful to better understand the problem, please add a comment and I will happily help you help me =).
Note: I'm currently running a test with a higher number of epochs on the whole dataset and I will update my post accordingly to the result, it might take a while though.