Determining Total Number of Parameters Within a Neural Network

Question

If we have an neural network of an input layer with 20 nodes + 2 hidden layers (size 10 and 5), how can we compute the total number of parameters in such a network?

Moreover, how can we know which activation functions will be appropriate for such a network? And what shapes will be the weights of?

score 2 · Answer 1 · answered Dec 23 '19 at 04:24

First, let's talk about the structure of the network regardless of activation function. We know that for any activation y_i = I(w_i,x) the argument itself is expressed as the dot-product of the vectors w_i and x before considering the function itself. So, one convenient way to think about this is that each layer in the network is simply a linear transform of the input vector itself: Wx. So, to go from a 20-dimension feature vector x to an output of 10-dimensions, then 5-dimensions, we'll need the employ of two matrices: a 10x20 (Let's call Hidden Layer 1 W_1), and a 5x10 (Let's call Hidden Layer 2 W_2. It follows that the input layer (W_0) would just be a diagonal matrix 20x20, with the diagonal being the weights applied to each individual input. So, in a sense, Y as a 5x1 vector can be thought of as Y = W_2 W_1 W_0 x. When thinking of things this way, you can immediately see the number of parameters needed (in this example, a total of 270).

There's a lot of debate about which activation functions are superior, or at least when I first started researching ANNs. One thing to keep in mind is that there are trade-offs with each activation function. These functions have advantages given certain conditioning of the input vector, but also come at the cost of overall computational complexity and reduced sensitivity in the face of large magnitude weights. For example, if using tanh() as the activation function, if a single weight's magnitude is in excess of 100x the others, the back-propagation error delta will shift all of the other weights of that node drastically, but to virtually no effect of that singular weight. This can be quite problematic as you become susceptible to training into a local-minima. Additionally, d/dx[tanh(x)] isn't computationally efficient when considering GPGPU acceleration. But (for as much as I've hit at that function), it's actually quite effective when dealing with frequency-domain or exponentially-correlated features.

So, what shapes will the weights be of? This isn't an easy question to answer because it's predicated on:

Structure of your network
Activation function employed
Back-propagation heuristics (e.g., CNN instead of general BP)
Underlying patterns in your training set

That final one is the most important, but it's easy to determine if there is underlying shape/structure to the weights before training. As a best-practice, consider employing Principal Component Analysis (PCA) first on your training set. If you find that the vast majority of your set can be reasonably represented with a very small subset of principal components, there's a strong likelihood a well-trained network will appear sparse (or even banded) in the earliest hidden layers.

Thank you for your detailed explanation! I'm still a little confused about how you came up with the value 270 though? (20*10) + (10*5) + (5*1) + (biases = 10 +5) = 270. But why are we adding 5*1? They're not doing so in example 1.2 here: https://towardsdatascience.com/counting-no-of-parameters-in-deep-learning-models-by-hand-8f1716241889 — x89, Dec 23 '19 at 05:13
And for this particular problem, I was not given any info regarding the outputs. — x89, Dec 23 '19 at 05:14
@x89, I came up with 270 for the 20 input weights, 10x20 first hidden layer, and 5x10 second hidden layer. Conditioning the outputs can be a bit subjective (for example, you may be training a classifier and simply go with `argmax(y)` so no weighting or biasing is needed. But strictly speaking, you'll always condition the input feature-vector. Since you didn't make mention of the output condition, I was under the impression your 2nd hidden layer was simply the output (which is true when you consider subsequent layers). — jhill515, Dec 23 '19 at 14:09

Manojk07 · Answer 2 · 2020-01-10T12:05:35.653

If you are using Tensorflow Keras, try model.summary(). It tells type, output shape and number of the parameters for each layer and also total number of parameters. You can visualize your network using plot_function from tf.keras.utils.

Read the documentation here https://www.tensorflow.org/api_docs/python/tf/keras/utils/plot_model?version=stable & https://www.tensorflow.org/api_docs/python/tf/keras/Model?version=stable

Choice of activation function for the output layer depends on the type of output you need. For example sigmoid for probabilities, ReLU for positive values etc. There is no general rule for choice of activation function for hidden layers. There are too many considerations to care of. However there are recommentations which works most of the time. See https://www.coursera.org/lecture/ai/how-to-choose-the-correct-activation-function-foyh8

Determining Total Number of Parameters Within a Neural Network

2 Answers2