In a machine translation seq2seq model (using RNN/GRU/LSTM) we provide sentence in a source language and train the model to map it to a sequence of words in another language (e.g., English to German).
The idea is, that the decoder part generates a classification vector (which has the size of target word vocabulary) and a softmax is applied on this vector followed by an argmax to get the index of the most probable word.
My question is: is there an upper limit to how large the target word vocabulary should be, considering:
- The performance remains reasonable (softmax will take more time for larger vectors)
- The accuracy/correctness of prediction is acceptable