15

I’m trying to understand the Synthesizer paper (https://arxiv.org/pdf/2005.00743.pdf 1) and there’s a description of the dense synthesizer mechanism that should replace the traditional attention model as described in the Transformer architecture.

enter image description here

The Dense Synthesizer is described as such:

enter image description here

So I tried to implement the layer and it looks like this but I’m not sure whether I’m getting it right:

class DenseSynthesizer(nn.Module):
    def __init__(self, l, d):
        super(DenseSynthesizer, self).__init__()
        self.linear1 = nn.Linear(d, l)
        self.linear2 = nn.Linear(l, l)

    def forward(self, x, v):
        # Equation (1) and (2)
        # Shape: l x l
        b = self.linear2(F.relu(self.linear1(x)))   
        # Equation (3)
        # [l x l] x [l x d] -> [l x d]
        return torch.matmul(F.softmax(b), v) 

Usage:

l, d = 4, 5

x, v =  torch.rand(l, d), torch.rand(l, d)

synthesis = DenseSynthesizer(l, d)
synthesis(x, v) 

Example:

x and v are tensors:

x = tensor([[0.0844, 0.2683, 0.4299, 0.1827, 0.1188],
         [0.2793, 0.0389, 0.3834, 0.9897, 0.4197],
         [0.1420, 0.8051, 0.1601, 0.3299, 0.3340],
         [0.8908, 0.1066, 0.1140, 0.7145, 0.3619]])

v = tensor([[0.3806, 0.1775, 0.5457, 0.6746, 0.4505],
         [0.6309, 0.2790, 0.7215, 0.4283, 0.5853],
         [0.7548, 0.6887, 0.0426, 0.1057, 0.7895],
         [0.1881, 0.5334, 0.6834, 0.4845, 0.1960]])

And passing through a forward pass through the dense synthesis, it returns:

>>> synthesis = DenseSynthesizer(l, d)
>>> synthesis(x, v) 

tensor([[0.5371, 0.4528, 0.4560, 0.3735, 0.5492],
        [0.5426, 0.4434, 0.4625, 0.3770, 0.5536],
        [0.5362, 0.4477, 0.4658, 0.3769, 0.5468],
        [0.5430, 0.4461, 0.4559, 0.3755, 0.5551]], grad_fn=<MmBackward>)

Is the implementation and understanding of the dense synthesizer correct?

Theoretically, how is that different from a multi-layered perceptron that takes in two different inputs and makes uses of it at different point in the forward propagation?

alvas
  • 115,346
  • 109
  • 446
  • 738
  • Also asked on https://discuss.pytorch.org/t/implementation-of-the-dense-synthesizer/79783 – alvas May 06 '20 at 08:34
  • 1
    It looks correct. And yes, it is just a MLP. Not sure if this is the kind of answer you expect? – BlackBear May 08 '20 at 12:16
  • 1
    The problem is in the notation in eq. 2. At least for me, it is not clear how they perform the linear transformation. It could be `(d, l)` and `(l, l)`, just like in your implementation, or `(d, d)` and `(d, l)`. I think the latter might work better. – Mohammad Arvan May 09 '20 at 19:56
  • 1
    Also, `(d, d)` and `(d, l)` is more aligned with `QK^t` attention. – Mohammad Arvan May 09 '20 at 20:19
  • 1
    @MohammadArvan I think you are right on the (d,d)->(d,l), this matches the number of parameters of the dense variant in table 1 – BlackBear May 15 '20 at 09:27

1 Answers1

4

Is the implementation and understanding of the dense synthesizer correct?

Not exactly, linear1 = nn.Linear(d,d) according to the paper and not (d,l). Of course this does not work if X.shape = (l,d) according to matrix multiplication rules.

This is because :

enter image description here enter image description here

So F is applied to each Xi in X for i in [1,l]

The resulting matrix B is then passed to the softmax function and multiplied by G(x). So you'd have to modify your code to sequentially process the input then use the returned matrix to compute Y.

how is that different from a multi-layered perceptron that takes in two different inputs and makes uses of it at different point in the forward propagation?

To understand, we need to put things into context, the idea of introducing attention mechanism was first described here in the context of Encoder - Decoder : https://arxiv.org/pdf/1409.0473.pdf

The core idea is to allow the model to have control over how the context vector from the encoder is retrieved using a neural network instead of relying solely on the last encoded state :

enter image description here

see this post for more detail.

The Transformers introduced the idea of using "Multi-Head Attention" (see graph below) to reduce the computational burden and focus solely on the attention mechanism itself. post

https://arxiv.org/pdf/1706.03762.pdf

enter image description here

enter image description here

So where does the Dense synthesizer fits into all of that ?

It simply replaces the Dot product (as illustrated in the first pictures in your post) by F(.). If you replace what's inside the softmax by F you get the equation for Y

enter image description here

Conclusion

This is an MLP but applied step wise to the input in the context of sequence processing.

Thank you

Dharman
  • 30,962
  • 25
  • 85
  • 135
Yoan B. M.Sc
  • 1,485
  • 5
  • 18
  • I don't think the dense synthesizer attention mechanism is applied step wise to the input. It's applied all at once for all states in the input within the encoder/decoder block of the transformer architecture and not explicitly across the encoder to decoder. Could you explain a little on the step-wise explanation? – alvas Dec 30 '20 at 04:19
  • @alvas, it's in the definition of `F`itself : "F(.), a parameterized function, for projecting input `Xi` from `d` dimensions to `l` dimensions". `F(.)` is applied to each token in `X` and `F(Xi) = Bi`. Otherwise matrix shape are incompatible for multiplication. – Yoan B. M.Sc Jan 04 '21 at 13:28