Why is the layernorm after the max operation so important?

Question

I am using several residual blocks and then following the head part the model. I tried two different architectures and found that the models can be different from each other via existence of the layer normalization.

The is my model structures

input → residual blocks → residual blocks → head part;

Each of residual blocks comprise the following operation:

→ conv1d → layernorm → conv1d → GELU → +
|______________________________________|^
            skip connection

The following are the two different head parts of the model

the first is:

max (along the sequence dimension) → layerNorm → Linear (in channels, num of classes)
The second is:

max(along the sequence dimension) → linear (in channels, in channels,) → LayerNom → Linear(in channels, number of classes)

The result is that: the performance of the first model can achieve 100% on the training set while 84% for the test test. The same number of the second model is 0.1 % on both train set and test set.

I can not understand why I put the linear operation between the max operation and the LayerNorm is making so much negative effects on the performance of the model.

In other words, if I put the Layernorm after the max operation above, the model can always achieve a competitive result no matter how many Linear layer following it.

E.g., max(along the sequence dimension) → layerNorm → Linear → layerNorm → linear → ... → Linear. However, if a linear transformation follows the max pool operation without Linear, the model can't learn anything.

Why should I keep the order (max and the layernorm), so that the model can be trained? Or perhaps the operation the resiblocks affects it?

I hope somebody can tell the reason behind it.

Related: *[Deep learning questions almost never have minimal reproducible examples](https://meta.stackoverflow.com/questions/424737)*, *[How can I ask a good question on machine learning (ML)?](https://meta.stackoverflow.com/questions/399477)* and *[Standard for machine learning questions](https://meta.stackoverflow.com/questions/380942)*. See also [the linked questions](https://meta.stackoverflow.com/questions/linked/380942). — Peter Mortensen, Aug 02 '23 at 15:10

Why is the layernorm after the max operation so important?

The is my model structures

The following are the two different head parts of the model

0 Answers0