0

I am doing some work with Theano based auto-encoder, giving input as samples from mixture of Gaussians, one hidden layer. I expected output to be same as input, but I am not achieving it. I have been inspired by this tutorial for implemenation. Is autoencoder with only one hidden layer is also sufficient to recover exact replica of output ?

My code looks like below :

` def train(self, n_epochs=100, mini_batch_size=1, learning_rate=0.01):
    index = T.lscalar()
    x=T.matrix('x')
    params = [self.W, self.b1, self.b2]
    hidden = self.activation_function(T.dot(x, self.W)+self.b1)
    output = T.dot(hidden,T.transpose(self.W))+self.b2
    output = self.output_function(output)


    # Use mean square error
    L = T.sum((x - output) ** 2)
    cost = L.mean()

    updates=[]

    #Return gradient with respect to W, b1, b2.
    gparams = T.grad(cost,params)

    #Create a list of 2 tuples for updates.
    for param, gparam in zip(params, gparams):
        updates.append((param, param-learning_rate*gparam))

    #Train given a mini-batch of the data.
    train = th.function(inputs=[index], outputs=cost, updates=updates,
                        givens={x:self.X[index:index+mini_batch_size,:]})
    import time
    start_time = time.clock()
    acc_cost = []
    for epoch in xrange(n_epochs):

        #print "Epoch:", epoch
        for row in xrange(0,self.m, mini_batch_size):
            cost = train(row)
        acc_cost.append(cost)

    plt.plot(range(n_epochs), acc_cost)
    plt.ylabel("cost")
    plt.xlabel("epochs")
    plt.show()

    # Format input data for plotable format
    norm_data = self.X.get_value()
    plot_var1 = []
    plot_var1.append(norm_data[:,0])
    plot_var2 = []
    plot_var2.append(norm_data[:,1])
    plt.plot(plot_var1, plot_var2, 'ro')

    # Hidden output
    x=T.dmatrix('x')
    hidden = self.activation_function(T.dot(x,self.W)+self.b1)
    transformed_data = th.function(inputs=[x], outputs=[hidden])
    hidden_data = transformed_data(self.X.get_value())
    #print "hidden_output ", hidden_data[0]

    # final output
    y=T.dmatrix('y')
    W = T.transpose(self.W)
    output = self.activation_function(T.dot(y,W) + self.b2)
    transformed_data = th.function(inputs=[y], outputs=[output])
    output_data = transformed_data(hidden_data[0])[0]
    print "decoded_output ", output_data

    # Format output data for plotable format
    plot_var1 = []
    plot_var1.append(output_data[:,0])
    plot_var2 = []
    plot_var2.append(output_data[:,1])
    plt.plot(plot_var1, plot_var2, 'bo')
    plt.show()



' 
Shyamkkhadka
  • 1,438
  • 4
  • 19
  • 29
  • Theoretically 1 hidden layer MLP is a universal function approximator, and using deeper layers might improve final model loss and generalization capability. However there might be some issue with your implementation. Could you provide the code you used? – Kh40tiK Nov 23 '16 at 10:44
  • I can't paste my codes here as it allows only few characters in comment – Shyamkkhadka Nov 23 '16 at 10:56
  • What does training loss curve look like? – Kh40tiK Nov 23 '16 at 11:39
  • Sorry, due to some internet problem, your previous answer seem to be deleted ? But my question was what is the significance of using two different weights, and why we need to do transpose of self.w2 ? Because, I think we need only one weight and transpose it in other layer. The learning curve seems to be good, with cost value gradually decreasing and after around 2000 epochs, it gets saturated, having cost value around 0. Thanks. – Shyamkkhadka Nov 23 '16 at 11:42

1 Answers1

0

In your code:

    params = [self.W, self.b1, self.b2]
    hidden = self.activation_function(T.dot(x, self.W)+self.b1)
    output = T.dot(hidden,T.transpose(self.W))+self.b2

You are using same weight for both input and output. What about:

    params = [self.W1, self.W2, self.b1, self.b2]
    hidden = self.activation_function(T.dot(x, self.W1)+self.b1)
    output = T.dot(hidden,self.W2)+self.b2

Autoencoder isn't PCA. If you want to use same weight, it may be a good idea to constrain weight to be orthogonal.

Otherwise, making deeper AE may help. Since only one independent weight matrix, the proposed model can hardly behave as a universal function approximator as a 3 layer MLP.

Kh40tiK
  • 2,276
  • 19
  • 29
  • Well, I suppose we need only one weight . Because in other layer (output layer), I simply transpose the weight. Like it is said [here ](http://deeplearning.net/tutorial/dA.html#autoencoders). Because w and w' are tied weights and are transpose of each other. So in your answer, what is the significance of using transpose of self.w2 ? – Shyamkkhadka Nov 23 '16 at 11:34
  • @Shyamkkhadka my bad, no need to transpose as long as shapes match (was copying too fast). The point of making two independent weights is making it a 3 layer MLP, which is a universal function approximator. If you have 4 weights in your AE, you can have two of them being transpose of each other. – Kh40tiK Nov 23 '16 at 12:01
  • So in your view if I want to approximate a simple auto encoder with 3 layers, I should use two different weights (w1,w2). And to me, it looks like 3 layer MLP. Is it in accordance with the principle of autoencoder ? We should have tied weights(transpose of weight) to each other. – Shyamkkhadka Nov 23 '16 at 12:05
  • @Shyamkkhadka Who said autoencoder **must** use tied weights? Only linear PCA is required to have *orthogonal* weights being transpose to each other, while using linear activation function. AEs can be viewed as an extension to PCA. In practice we could insert convolutional, recurrent, adversarial, noisy ... basically anything in a AE. – Kh40tiK Nov 23 '16 at 12:10
  • Ok may be I was wrong with autoencoder. I will try with 2 different weights then. – Shyamkkhadka Nov 23 '16 at 12:18
  • I tried with different weight matrix values as W1, W2. But still the output and inputs are not same. Can I get the output and input to be same with just one hidden layer ?The learning curve looks good. Please look at my gist https://gist.github.com/shyamkkhadka/080565367412194e34cba1890c1f3bf3. – Shyamkkhadka Nov 23 '16 at 14:09
  • @Shyamkkhadka Yes, with a large enough hidden layer (add number of hidden units), it will do better. And NO, there's no point to make them "same". If you eventually make training loss 0 with a large enough layer, your net will be basically overfitting and useless in real application. I suggest you learn some basics on ML before proceeding. – Kh40tiK Nov 23 '16 at 14:58
  • I tried with 3 hidden layer units also. The output is same for 3 or 1 hidden layers. Actually first I am trying to get exact replica of input in output layer. After this, I am using some hidden layer data for other purpose. I don't know what is wrong with above program. Can you suggest any things ? – Shyamkkhadka Nov 27 '16 at 08:39