Implementing Maxout Activation in Theano

Question

The only example of maxout implementation in Theano is on this link. My understanding is that I use any activation function and then maxout is just a post processing of the hidden layer outputs.

I tried to apply this to my own HiddenLayer class. Below is the class before maxout:

class HiddenLayer(object):

    def __init__(self, rng, input, n_in, n_out, W=None, b=None, activation=T.tanh):
        '''
        Initialise the Hidden Layer
        Parameters:
        rng        - random number generator
        input      - input values from the preceding layer
        n_in       - number of input nodes (number of nodes of the preceding layer)
        n_out      - number of output nodes (number of nodes of this hidden layer)
        W          - the Weights of the layer
        b          - the bias of the layer
        activation - the activation function: T.tanh(), relu()
        '''
        self.input = input

        W, b = self.init_weights(rng, n_in, n_out, W, b, activation) # initialise the wrights of a hidden layer

        self.W = W; self.b = b;

        lin_output = T.dot(input, self.W) + self.b 

        self.output = (lin_output if activation is None else activation(lin_output))

        # parameters of the model
        self.params = [self.W, self.b]

If I understood the link correctly, the class after maxout implementation should look as below. Is this correct? If not, could you point out which part I misunderstood?

class HiddenLayer(object):

    def __init__(self, rng, input, n_in, n_out, W=None, b=None, activation=T.tanh, maxout=False):
        '''
        maxout     - whether to apply maxout after the activation function
        '''
        self.input = input

        W, b = self.init_weights(rng, n_in, n_out, W, b, activation) # initialise the wrights of a hidden layer

        self.W = W; self.b = b;

        lin_output = T.dot(input, self.W) + self.b 

        self.output = (lin_output if activation is None else activation(lin_output))

        if maxout: #apply maxout to the 'activated' hidden layer output
            maxout_out = None   
            maxoutsize = n_out                                                    
            for i in xrange(maxoutsize):                                            
              t = self.output[:,i::maxoutsize]                                   
              if maxout_out is None:                                              
                maxout_out = t                                                  
              else:                                                               
                maxout_out = T.maximum(maxout_out, t)  
            self.output = maxout_out

        # parameters of the model
        self.params = [self.W, self.b]

Looks like it could work. Where is the problem? Why don't you just reshape and do one single `T.maximum` operation over an appropriate axis (e.g. `T.max(output.reshape((output.shape[0], -1, maxoutsize)), axis=2)`)? — eickenberg, Mar 01 '16 at 08:44
I am confused about the `maxoutsize`, in the code I set it to `n_out` (number of output nodes (number of nodes of this hidden layer)). Is this right? ANd reagarding the `T.maximum` operation, you are suggesting replacing the for loop with that, am I right? — Zhubarb, Mar 01 '16 at 08:45
Well it looks to me more like the number of units you want to pool over. This number *times* the number of output nodes should give you the number of input nodes. I would start with `maxoutsize = 2` — eickenberg, Mar 01 '16 at 08:46
@eickenberg ok, thank you. And am I right in understanding that `maxout` is a post-processing step and can be used with any activation function (that initially converts `lin_output` to `self.output` in the code)? — Zhubarb, Mar 01 '16 at 08:50
Well it can actually replace an activation function. You can plug it directly onto a linear layer, and it will create the appropriate nonlinearity. If you take the max over two channels, it can act like a rectifier if necessary, or an absolute value, or something much more general that you can obtain by taking the max of two linear functions. And in the limit of a high `maxoutsize` you can approximate any convex activation function. — eickenberg, Mar 01 '16 at 08:54
@eickenberg, thank you. What confused me in the first place was that on the link the author first calls `output = activation(T.dot(input,W) + b)` and then feeds this into `maxout`. My initial understanding from reading the paper was as you stated above in your comment. — Zhubarb, Mar 01 '16 at 08:59

Implementing Maxout Activation in Theano

0 Answers0