First, let's talk about the structure of the network regardless of activation function. We know that for any activation y_i = I(w_i,x) the argument itself is expressed as the dot-product of the vectors w_i and x before considering the function itself. So, one convenient way to think about this is that each layer in the network is simply a linear transform of the input vector itself: Wx. So, to go from a 20-dimension feature vector x to an output of 10-dimensions, then 5-dimensions, we'll need the employ of two matrices: a 10x20 (Let's call Hidden Layer 1 W_1), and a 5x10 (Let's call Hidden Layer 2 W_2. It follows that the input layer (W_0) would just be a diagonal matrix 20x20, with the diagonal being the weights applied to each individual input. So, in a sense, Y as a 5x1 vector can be thought of as Y = W_2 W_1 W_0 x. When thinking of things this way, you can immediately see the number of parameters needed (in this example, a total of 270).
There's a lot of debate about which activation functions are superior, or at least when I first started researching ANNs. One thing to keep in mind is that there are trade-offs with each activation function. These functions have advantages given certain conditioning of the input vector, but also come at the cost of overall computational complexity and reduced sensitivity in the face of large magnitude weights. For example, if using tanh()
as the activation function, if a single weight's magnitude is in excess of 100x the others, the back-propagation error delta will shift all of the other weights of that node drastically, but to virtually no effect of that singular weight. This can be quite problematic as you become susceptible to training into a local-minima. Additionally, d/dx[tanh(x)] isn't computationally efficient when considering GPGPU acceleration. But (for as much as I've hit at that function), it's actually quite effective when dealing with frequency-domain or exponentially-correlated features.
So, what shapes will the weights be of? This isn't an easy question to answer because it's predicated on:
- Structure of your network
- Activation function employed
- Back-propagation heuristics (e.g., CNN instead of general BP)
- Underlying patterns in your training set
That final one is the most important, but it's easy to determine if there is underlying shape/structure to the weights before training. As a best-practice, consider employing Principal Component Analysis (PCA) first on your training set. If you find that the vast majority of your set can be reasonably represented with a very small subset of principal components, there's a strong likelihood a well-trained network will appear sparse (or even banded) in the earliest hidden layers.