0

I am using several residual blocks and then following the head part the model. I tried two different architectures and found that the models can be different from each other via existence of the layer normalization.

The is my model structures

input → residual blocks → residual blocks → head part;

Each of residual blocks comprise the following operation:

→ conv1d → layernorm → conv1d → GELU → +
|______________________________________|^
            skip connection

The following are the two different head parts of the model

  1. the first is:

    max (along the sequence dimension) → layerNorm → Linear (in channels, num of classes)

  2. The second is:

max(along the sequence dimension) → linear (in channels, in channels,) → LayerNom → Linear(in channels, number of classes)

The result is that: the performance of the first model can achieve 100% on the training set while 84% for the test test. The same number of the second model is 0.1 % on both train set and test set.

I can not understand why I put the linear operation between the max operation and the LayerNorm is making so much negative effects on the performance of the model.

In other words, if I put the Layernorm after the max operation above, the model can always achieve a competitive result no matter how many Linear layer following it.

E.g., max(along the sequence dimension) → layerNorm → Linear → layerNorm → linear → ... → Linear. However, if a linear transformation follows the max pool operation without Linear, the model can't learn anything.

Why should I keep the order (max and the layernorm), so that the model can be trained? Or perhaps the operation the resiblocks affects it?

I hope somebody can tell the reason behind it.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
dreamg
  • 1
  • 2
  • Related: *[Deep learning questions almost never have minimal reproducible examples](https://meta.stackoverflow.com/questions/424737)*, *[How can I ask a good question on machine learning (ML)?](https://meta.stackoverflow.com/questions/399477)* and *[Standard for machine learning questions](https://meta.stackoverflow.com/questions/380942)*. See also [the linked questions](https://meta.stackoverflow.com/questions/linked/380942). – Peter Mortensen Aug 02 '23 at 15:10

0 Answers0