4

For some self-studying, I'm trying to implement simple a sequence-to-sequence model using Keras. While I get the basic idea and there are several tutorials available online, I still struggle with some basic concepts when looking these tutorials:

  • Keras Tutorial: I've tried to adopt this tutorial. Unfortunately, it is for character sequences, but I'm aiming for word sequences. There's is a block to explain the required for word sequences, but this is currently throwing "wrong dimension" errors -- but that's OK, probably some data preparation errors from my side. But more importantly, in this tutorial, I can clearly see the 2 types of input and 1 type of output: encoder_input_data, decoder_input_data, decoder_target_data
  • MachineLearningMastery Tutorial: Here the network model looks very different, completely sequential with 1 input and 1 output. From what I can tell, here the decoder gets just the output of the encoder.

Is it correct to say that these are indeed two different approaches towards Seq2Seq? Which one is maybe better and why? Or do I read the 2nd tutorial wrongly? I already got an understanding in sequence classification and sequences labeling, but with sequence-to-sequence it hasn't properly clicked yet.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Christian
  • 3,239
  • 5
  • 38
  • 79

1 Answers1

6

Yes, those two are different approaches and there are other variations as well. MachineLearningMastery simplifies things a bit to make it accessible. I believe Keras method might perform better and is what you will need if you want to advance to seq2seq with attention which is almost always the case.

MachineLearningMastery has a hacky workaround that allows it to work without handing in decoder inputs. It simply repeats the last hidden state and passes that as the input at each timestep. This is not a flexible solution.

    model.add(RepeatVector(tar_timesteps))

On the other hand Keras tutorial has several other concepts like teacher forcing (using targets as inputs to the decoder), embeddings(lack of) and a lengthier inference process but it should set you up for attention.

I would also recommend pytorch tutorial which I feel is the most appropriate method.

Edit: I dont know your task but what you would want for word embedding is

x = Embedding(num_encoder_tokens, latent_dim)(encoder_inputs)

Before that, you need to map every word in the vocabulary into an integer, turn every sentence into a sequence of integers and pass that sequence of integers to the model (embedding layer of latent_dim maybe 120). So each of your word is now represented by a vector of size 120. Also your input sentences must be all of the same size. So find an appropriate max sentence length and turn every sentence into that length and pad with zero if sentences are shorter than max len where 0 represents a null word perhaps.

Littleone
  • 641
  • 6
  • 14
  • Littleone, thanks! That helps! I didn't look into the PyTorch tutorial. That made me actually just realize that using the (shifted) targets as input for the decoder is the "teacher forcing". While I also try with PyTorch, for this project/tutorial I need to stick with Keras. I just need to figure out how to adopt the Keras tutorial to word sequences -- the presented extensions does use Embeddings, but I still struggle, probably because I'm not sure how the input data is supposed to be shaped. – Christian Feb 11 '18 at 01:41
  • Thanks for the edits. Yeah, I think (well, hope) I did the encoding and padding of the sequences correctly since I used embeddings in previous tutorials, e.g., sequence classification and sequence labeling. I've tried a couple of hours more with no prevail. [https://stackoverflow.com/questions/48728099/word-level-seq2seq-with-keras](I therefore submitted another question, but much more specific with my problem.) I show there how my input/target data and the models look like. I really feel I'm missing something here. – Christian Feb 11 '18 at 03:54
  • In 'Keras Tutorial', there is a 'teacher forcing' using 'decoder_input_data', which is same as 'target_data' offset by one timestep. This means use 'target' as input feature as well. In the "MachineLearningMastery Tutorial", It does not seem to have this 'teacher forcing' element. Can you explain it a bit more? Thank you very much! – Jundong May 03 '18 at 23:47