I've recently implemented an autoencoder in numpy. I have checked all the gradients numerically and they seem correct, and the cost function also seems to decrease at each iteration, if the learning rate is sufficiently small.
The problem:
As you know, an autoencoder gets an input x
, and it tries to return something as close to x
as possible.
Whenever my x
is a row vector, it works very well. The cost function decreases to 0, and we get very good results, for example: when x = [[ 0.95023264 1. ]]
the output i got after 10000 iteration was xhat = [[ 0.94972973 0.99932479]]
and the cost function is about 10^-7
However, when my x
is not a row vector, even if its a small 2 by 2 matrix, the output isnt close to the original x, and the cost function does not decrease to 0, but rather it plateaus.
Example:
When the input is x = [[ 0.37853141 1. ][ 0.59747807 1. ]]
the output is xhat = [[ 0.48882265 0.9985147 ][ 0.48921648 0.99927143]]
. you can see the first column of xhat does not seem close to the first column of x, but rather, it is close to the average of the first column of x. This seems to happen in all the tests that I ran. Also, the cost function plateaus at around 0.006, it will not reach 0.
Why does this happen and how can I fix it? Again - the derivatives are correct. I'm not sure how to fix this.
My code
import numpy as np
import matplotlib.pyplot as plt
def g(x): #sigmoid activation functions
return 1/(1+np.exp(-x)) #same shape as x!
def gGradient(x): #gradient of sigmoid
rows,cols = x.shape
grad = np.zeros((cols, cols))
for i in range(0, cols):
grad[i, i] = g(x[0, i])*(1-g(x[0, i]))
return grad
def cost(x, xhat): #mean squared error between x the data and xhat the output of the machine
return ((x - xhat)**2).sum()/(2 * m)
m, n = 2, 1
trXNoBias = np.random.rand(m, n)
trX = np.ones((m, n+1))
trX[:, :n] = trXNoBias #add the bias, column of ones
n = n+1
k = 1 #num of neurons in the hidden layer of the autoencoder, shouldn't matter too much
numIter = 10000
learnRate = 0.001
x = trX
w1 = np.random.rand(n, k) #weights from input layer to hidden layer, shape (n, k)
w2 = np.random.rand(k, n) #weights from hidden layer to output layer of the autoencoder, shape (k, n)
w3 = np.random.rand(n, n) #weights from output layer of autoencoder to entire output of the machine, shape (n, n)
costArray = np.zeros((numIter, ))
for i in range(0, numIter):
#Feed-Forward
z1 = np.dot(x,w1) #output of the input layer, shape (m, k)
h1 = g(z1) #input of hidden layer, shape (m, k)
z2 = np.dot(h1, w2) #output of the hidden layer, shape (m, n)
h2 = g(z2) #Output of the entire autoencoder. The output layer of the autoencoder. shape (m, n)
xhat = np.dot(h2, w3) #the output of the machine, which hopefully resembles the original data x, shape (m, n)
print(cost(x, xhat))
costArray[i] = cost(x, xhat)
#Backprop
dSdxhat = (1/float(m)) * (xhat-x)
dSdw3 = np.dot(h2.T, dSdxhat)
dSdh2 = np.dot(dSdxhat, w3.T)
dSdz2 = np.dot(dSdh2, gGradient(z2))
dSdw2 = np.dot(h1.T,dSdz2)
dSdh1 = np.dot(dSdz2, w2.T)
dSdz1 = np.dot(dSdh1, gGradient(z1))
dSdw1 = np.dot(x.T,dSdz1)
w3 = w3 - learnRate * dSdw3
w2 = w2 - learnRate * dSdw2
w1 = w1 - learnRate * dSdw1
plt.plot(costArray)
plt.show()
print(x)
print(xhat)