2

I am trying to understand sequence-to-sequence learning with a RNN. What I understand so far, is that the output of the encoder is used to condition the decoder.

Yet, I have two sources which, in my opinion, do the conditioning differently and I would like to know which way is valid (may be both) or if I miss something.

Source: Neural Network Methods for Natural Language Processing by Yoav Goldberg

As far as I understand the author, the decoder operates in every step with a state vector AND the decoder output AND the next part of the sequence. Thus, the state vector of the decoder is seperated from the result of the encoder.

enter image description here

Source: A ten-minute introduction to sequence-to-sequence learning in Keras by Francois Chollet

As far as I understand the author and the source, the decoder is provided with the encoder state as intial state. Thus, the state vector of the decoder is the output of the decoder. The decoder steps only depend on the encoder output through the state vector.

enter image description here

lwi
  • 1,682
  • 12
  • 21

1 Answers1

2

There's many ways to feed the encoder output into the decoder; either continuously feeding it into the decoder, or allowing it to instantiate the decoder hidden state (either directly or after a shallow transformation), or even by concatenating the decoder output with the encoder output before passing the two of them to the final output prediction (see Cho et.al '14). Generally, each extra vector you feed into your decoder scales its computational complexity in rather unfavorable terms; if for instance you decide to feed the encoder output E as input at each step, you increase your input space from ‖X‖ to ‖X+E‖, which translates in a parameter space increase of E*H (in the simple RNN case, i.e. not considering gating), where H is your hidden size. This does increase the network's capacity but also its tendency to overfit, yet this is sometimes necessary (e.g. in cases when you are trying to decode into long output sequences, in which the network needs to keep being 'reminded' of what its working on).

In any case, the formalism remains the same; the decoder is always conditioned on the encoder output, so you will always be maximizing p(yt | yt-1...y0, X) -- the difference lies on how you decide to factor the input context into your model.

KonstantinosKokos
  • 3,369
  • 1
  • 11
  • 21
  • Thanks, that clears things up a little. So, that means both methods are valid. and The first is "reminding" the decoder permanently what it is working on, whereas in the second case it will weight that information approx proportionally with any other sequence information. Thus, the longer the sequence, the less influencial the conditioning. Is that correct? – lwi Dec 05 '18 at 10:38
  • 1
    Intuitively yes; in the second case, the encoder output is used exactly once, as the 'seed' (i.e. initial hidden state). As the sequence progresses and the hidden state changes, this initial seed might eventually be outweighted by the network's temporal dynamics. – KonstantinosKokos Dec 05 '18 at 10:46