2

I have been playing around in TensorFlow and made a generic fully connected model.

At each layer I'm applying

sigmoid(WX + B)

which as everybody knows, works well.

I then started messing around with the function that is applied at each layer and found that functions such as

sigmoid(U(X^2) + WX + B)

work just as well when they are optimized.

What does varying this inner function accomplish? Is there a functional application in which changing the inner function would improve the learning of the model or would any function that combines the input and some weights have the same learning capabilities no matter what data is being learned?

I'm aware of many other models of neural nets (such as convolutional nets, recurrent nets, residual nets, etc) so I'm not looking for an explanation of different kinds of nets (unless of course, a certain type of net directly applies what I'm talking about). Mostly interested in a simple fully connected scenario.

Marcin Możejko
  • 39,542
  • 10
  • 109
  • 120
Michael Hackman
  • 491
  • 2
  • 6
  • 21
  • Interesting: https://en.wikipedia.org/wiki/Activation_function. There are a lot of different activation functions. https://stats.stackexchange.com/questions/115258/comprehensive-list-of-activation-functions-in-neural-networks-with-pros-cons – Thomas Wagenaar May 16 '17 at 09:07

2 Answers2

1

In theory, both methods have exactly the same potential, and can reach any continuous target function, given enough layers and training time / data. Using sigmoid(U(X^2) + WX + B) makes each layer smarter, but also harder to learn (especially without overfitting), so if you use it, you should put less layers in your network to avoid overfitting.

In total, choosing between the first method or the second one with less layers is mainly based on experience: on your problems, one may work better than the other, but it is impossible to know which in theory. If your target function is almost polynomial, the second solution is probably better. In other cases, if you don't want to train both versions with different network sizes, I would go for the first solution, for several reasons:

  • there are more linear functions involved, which give easier gradients to compute, so it may be faster
  • research these last years seem to indicate that deep networks are often better than shallow ones with bigger layers in practice (though not in all cases)
  • it's the common practice

In terms of total running time, I have no idea which would be better (considering that you're using less layers with the second option).

gdelab
  • 6,124
  • 2
  • 26
  • 59
1

So basically there are three important factors in considering your problem:

  1. Computational complexity and stability: computing your function (in theory) should take more time as the at least two more operations are need. I think in this case it's not a problem but e.g. comparision of sigmoid where you need to compute both exp and division is way more costly than computing relu - what was one of the reasons why relu become so popular. Moreover - as square function diverges really fast and the saturation of sigmoid is well known problem - you might also suffer more severly from vanishing gradient and computational over/underflow.
  2. Number of parameters: there is an additional set parameter connected with every unit in second approach. In case when your model is small - it's not a huge problem - but as long as neural nets are used for really memory and time consuming task - this might be a huge downside of the second activation. This is also partly the reason why a really simple functions are more favorable to use in deep learning.
  3. Expressional power - this is the place where your second function could actually help. Not only because the square makes your function more complexed. This is also because actually your function is asymptotically bell-shaped what could make it better for catching a local dependiencies. This is could be a huge downside of both sigmoid and relu as both of this functions makes every unit to have a global infulence on your prediction whereas bell-shaped functions tends to favor more a local dependencies without affecting data points which are lying outside of there interest regions. In practice - usually this problem is solved by applying a really deep and wide topology - which in case of a huge dataset - is usually balancing the influence of single units.
Marcin Możejko
  • 39,542
  • 10
  • 109
  • 120