0

I have a fully working seq2seq attention model with beam search and it does give improved results. But it takes > 1min for inferencing (batch-size 1024) with k=5 (k is my hypotheses) because none of it is parallelised. Everything happens 1 sample at a time.

Task (simplified)
Goal is sentence translation, 15 words Lang A to 15 words Lang B.

  • Encoder is a RNN that takes in 15 word sentence and encodes a representation of it, gives out a [timestep, 512] matrix along with final hidden state.
  • Decoder is another RNN, takes encoder hidden state as initial state, uses [timestep, 512] matrix for attention and outputs translated words[batches] one timestep at a time. Naturally, there is some form of parallelization till this point.
  • During inference stage, beam search is used. At each timestep of the decoder, rather than taking the predicted word with highest prob, I take k best words. And provide k words as input to the next timestep so that it can predict the next word in the sentence (rest of the algorithm is given below). Algorithm makes decoding less greedy anticipating results with higher total probability in succeeding timesteps.
for each element in the test-set
    calculate initial k (decoder-encoder step)
    for range(timesteps-1)
        for each prev k
            get hidden state
            obtain its best k
            save hidden state
        find new k from k*k possible ones
        ##update hypotheses based on new found k
        for element in k 
            copy hidden state
            change hypotheses if necessary
            append new k to hypotheses

There are 6 tensors and 2 lists to keep track and handle state changes. Is there any room for speedup or parallelisation here? perhaps each k can go through enncode-decode simultaneously? Any help is much appreciated.

Littleone
  • 641
  • 6
  • 14
  • You need to give a lot more detail if you want people to understand your question. Like for example: what is k? Your encoder maps from what to what? What is your beam search searching for? etc. etc. – patapouf_ai Feb 22 '18 at 12:50
  • 1
    @patapouf_ai Sorry, I made a few assumptions on the terms. I have added a bit more about the task. Its a regular seq2seq attention model with beam search inside the decoder stage during inference. – Littleone Feb 22 '18 at 15:15
  • Your notation is still a bit confusing no? k=5, but if I understood correctly, k is also a word? – patapouf_ai Feb 23 '18 at 08:10
  • It is also not clear if your search is exponential: first choose k hypothesis, then for each choose k hypothesis, then for each choose k hypothesis, etc. etc. etc. Or if it is quadratic choose k hypothesis, then for each choose another k hypothesis, then evaluate the best options and keep only the best, then start over. – patapouf_ai Feb 23 '18 at 08:13
  • But in any case. Yes. Your algorithm would benefit greatly from parallelization. – patapouf_ai Feb 23 '18 at 08:14
  • 1
    @patapouf_ai k hypotheses means k possibilities, say at the end of timestep 3, k=3 would be "That is house", "That is a", "That house is" (assuming task is LangX - English translation), issue is in the next timestep, each of the 3 hypothesis will get its own best 3 and that means I also need to track their corresponding hidden states(can't parallelise easily because of this). From 3x3 possible hypotheses, I choose 3 and send it next timestep and so on, so not exponential. – Littleone Feb 24 '18 at 03:08
  • So in this case, yes. You should gain a 3 to 9 time speedup by parallelizing. – patapouf_ai Feb 26 '18 at 08:07

0 Answers0