I am using several residual blocks and then following the head part the model. I tried two different architectures and found that the models can be different from each other via existence of the layer normalization.
The is my model structures
input → residual blocks → residual blocks → head part;
Each of residual blocks comprise the following operation:
→ conv1d → layernorm → conv1d → GELU → +
|______________________________________|^
skip connection
The following are the two different head parts of the model
the first is:
max (along the sequence dimension) → layerNorm → Linear (in channels, num of classes)
The second is:
max(along the sequence dimension) → linear (in channels, in channels,) → LayerNom → Linear(in channels, number of classes)
The result is that: the performance of the first model can achieve 100% on the training set while 84% for the test test. The same number of the second model is 0.1 % on both train set and test set.
I can not understand why I put the linear operation between the max operation and the LayerNorm is making so much negative effects on the performance of the model.
In other words, if I put the Layernorm after the max operation above, the model can always achieve a competitive result no matter how many Linear layer following it.
E.g., max(along the sequence dimension) → layerNorm → Linear → layerNorm → linear → ... → Linear. However, if a linear transformation follows the max pool operation without Linear, the model can't learn anything.
Why should I keep the order (max and the layernorm), so that the model can be trained? Or perhaps the operation the resiblocks affects it?
I hope somebody can tell the reason behind it.