Autoencoder - cost decreases but wrong output when more than one data example

Question

I've recently implemented an autoencoder in numpy. I have checked all the gradients numerically and they seem correct, and the cost function also seems to decrease at each iteration, if the learning rate is sufficiently small.

The problem:

As you know, an autoencoder gets an input x, and it tries to return something as close to x as possible.

Whenever my x is a row vector, it works very well. The cost function decreases to 0, and we get very good results, for example: when x = [[ 0.95023264 1. ]] the output i got after 10000 iteration was xhat = [[ 0.94972973 0.99932479]] and the cost function is about 10^-7

However, when my x is not a row vector, even if its a small 2 by 2 matrix, the output isnt close to the original x, and the cost function does not decrease to 0, but rather it plateaus.

Example:

When the input is x = [[ 0.37853141 1. ][ 0.59747807 1. ]] the output is xhat = [[ 0.48882265 0.9985147 ][ 0.48921648 0.99927143]]. you can see the first column of xhat does not seem close to the first column of x, but rather, it is close to the average of the first column of x. This seems to happen in all the tests that I ran. Also, the cost function plateaus at around 0.006, it will not reach 0.

Why does this happen and how can I fix it? Again - the derivatives are correct. I'm not sure how to fix this.

My code

import numpy as np
import matplotlib.pyplot as plt

def g(x): #sigmoid activation functions
    return 1/(1+np.exp(-x)) #same shape as x!

def gGradient(x): #gradient of sigmoid
    rows,cols = x.shape
    grad = np.zeros((cols, cols))
    for i in range(0, cols):
        grad[i, i] = g(x[0, i])*(1-g(x[0, i]))
    return grad

def cost(x, xhat): #mean squared error between x the data and xhat the output of the machine
    return ((x - xhat)**2).sum()/(2 * m)

m, n = 2, 1
trXNoBias = np.random.rand(m, n)
trX = np.ones((m, n+1))
trX[:, :n] = trXNoBias #add the bias, column of ones
n = n+1

k = 1 #num of neurons in the hidden layer of the autoencoder, shouldn't matter too much
numIter = 10000
learnRate = 0.001
x = trX
w1 = np.random.rand(n, k) #weights from input layer to hidden layer, shape (n, k)
w2 = np.random.rand(k, n) #weights from hidden layer to output layer of the autoencoder, shape (k, n)
w3 = np.random.rand(n, n) #weights from output layer of autoencoder to entire output of the machine, shape (n, n)

costArray = np.zeros((numIter, ))
for i in range(0, numIter):
    #Feed-Forward
    z1 = np.dot(x,w1) #output of the input layer, shape (m, k)
    h1 = g(z1) #input of hidden layer, shape (m, k)

    z2 = np.dot(h1, w2) #output of the hidden layer, shape (m, n)
    h2 = g(z2) #Output of the entire autoencoder. The output layer of the autoencoder. shape (m, n)

    xhat = np.dot(h2, w3) #the output of the machine, which hopefully resembles the original data x, shape (m, n)

    print(cost(x, xhat))
    costArray[i] = cost(x, xhat)

    #Backprop
    dSdxhat = (1/float(m)) * (xhat-x)
    dSdw3 = np.dot(h2.T, dSdxhat)
    dSdh2 = np.dot(dSdxhat, w3.T)
    dSdz2 = np.dot(dSdh2, gGradient(z2))
    dSdw2 = np.dot(h1.T,dSdz2)
    dSdh1 = np.dot(dSdz2, w2.T)
    dSdz1 = np.dot(dSdh1, gGradient(z1))
    dSdw1 = np.dot(x.T,dSdz1)

    w3 = w3 - learnRate * dSdw3
    w2 = w2 - learnRate * dSdw2
    w1 = w1 - learnRate * dSdw1

plt.plot(costArray)
plt.show()

print(x)
print(xhat)

First of all why you have **a single neuron** in the hidden layer? And the comment that it should not matter too much? It matters a lot. Furthermore, your models lacks biases in both encoder (less important since your data is augmented with "ones") and decoder (more important) — lejlot, Oct 25 '16 at 18:09
About the neuron - I said it shouldn't matter too much because even when I increase it to 5, 10, or even 100, it doesn't help with the convergence of the cost function. About the biases - Could you explain? I added a column of ones to the data. I thought that takes care of the biases. — Oria Gruber, Oct 25 '16 at 18:13
I'm also not sure if your gradients are at all correct. Why does the `gGradient` function return something of shape `(cols, cols)`? I think its output should be the same size as its input, something like `gGradient = lambda h: h*(1 - h)`. That's going to change the shapes in your backprop, suggesting to me that there's something wrong there, too. — hunse, Oct 25 '16 at 18:13
This only works for linear models (adding 1s). This no longer applies for non-linear, multi layer models. Each of your 3 layers should be of form `h = w.dot(x) + b`, not just `h = w.dot(x)` (or, equivalently you can append column of 1s after each layer). In terms of gradients - please, do not compute them by hand. This makes no sense when we have tools like autograd, this allows differentiation of the numpy code, and you will make sure there are no bugs then. — lejlot, Oct 25 '16 at 18:15
@lejiot: I think a lot can be learned by computing gradients by hand, and that's what Oria is going for here, so saying to not compute them by hand isn't very helpful in figuring out what's wrong with this code. — hunse, Oct 25 '16 at 18:17
I agree with you completely about autograd, but this is an exercise. the entire point is to do it alone. which is proving rather difficult but I think I got it correct. At least the differentiation part. The biases you may be correct. — Oria Gruber, Oct 25 '16 at 18:18
One other thing to keep in mind: Single layer autoencoders often use only one set of weights. These weights connect the inputs to the hiddens, and then the transpose of those weights is used to get the outputs from the hiddens. This helps constrain the problem. Having two sets of weights can sometimes result in the autoencoder finding a local minimum and not learning anything useful. — hunse, Oct 25 '16 at 18:22
This may seem odd, but I managed to solve it. Strangely, changing the rand to randn seemed to do the trick. Is there any explanation as to why? I've checked several times now, it's working correctly the vast majority of the times (sometimes i get stuck at a local minima. the gradients are very close to zero, but xhat isnt very close to x) — Oria Gruber, Oct 26 '16 at 01:04

Autoencoder - cost decreases but wrong output when more than one data example

0 Answers0