0

I want to tie weights of the embedding layer and the next_word prediction layer of the decoder. The embedding dimension is set to 300 and the hidden size of the decoder is set to 600. Vocabulary size of the target language in NMT is 50000, so embedding weight dimension is 50000 x 300 and weight of the linear layer which predicts the next word is 50000 x 600.

So, how can I tie them? What will be the best approach to achieve weight tying in this scenario?

kmario23
  • 57,311
  • 13
  • 161
  • 150
Wasi Ahmad
  • 35,739
  • 32
  • 114
  • 161

3 Answers3

3

Weight Tying : Sharing the weight matrix between input-to-embedding layer and output-to-softmax layer; That is, instead of using two weight matrices, we just use only one weight matrix. The intuition behind doing so is to combat the problem of overfitting. Thus, weight tying can be considered as a form of regularization.

This has been implemented in word language model in PyTorch examples

kmario23
  • 57,311
  • 13
  • 161
  • 150
  • I have seen that example and I know the things you mentioned. I want to know, particularly in the scenario I mentioned, what is the best approach to tie weights? please note the shapes, tying in my case is not straight-forward. – Wasi Ahmad Mar 15 '18 at 23:23
  • 1
    I think the real intuition is that they theoretically the same. I.e. a projection from and to a 1-hot representation. "In both matrices, we expect rows that correspond to similar words to be similar: for the input embedding, we would like the network to react similarly to synonyms, while in the output embedding, we would like the scores of words that are interchangeable to be similar" https://www.aclweb.org/anthology/E17-2025.pdf – Maverick Meerkat Nov 26 '19 at 16:39
3

You could use linear layer to project the 600 dimensional space down to 300 before you apply the shared projection. This way you still get the advantage that the entire embedding (possibly) has a non-zero gradient for each mini-batch but at the risk of increasing the capacity of the network slightly.

pups
  • 82
  • 7
0

Did you check the code that kmario23 shared? Because it is written that if the hidden size and the embedding sizes are not equal then raise an exception. So, this means if you really want to tie the weights then you should decrease the hidden size of your decoder to 300.

On the other hand, if you rethink your idea, what you really want to do is to eliminate the weight tying. Why? Because basically, you want to use a transformation which needs another matrix.

Kadir Gunel
  • 309
  • 3
  • 14