What's the general tradeoff between choosing BPE vs WordPiece Tokenization? When is one preferable to the other? Are there any differences in model performance between the two? I'm looking for a general overall answer, backed up with specific examples. Thanks!
-
Does this answer your question? [How is WordPiece tokenization helpful to effectively deal with rare words problem in NLP?](https://stackoverflow.com/questions/55382596/how-is-wordpiece-tokenization-helpful-to-effectively-deal-with-rare-words-proble) – dennlinger Jun 02 '20 at 17:50
-
1The question is when to use which? How should one choose one over another? – vgoklani Aug 16 '20 at 18:00
1 Answers
In contrast to BPE, WordPiece does not choose the most frequent symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary. Maximizing the likelihood of the training data is equivalent to finding the symbol pair, whose probability divided by the probabilities of its first symbol followed by its second symbol is the greatest among all symbol pairs.
Intuitively, WordPiece is slightly different to BPE in that it evaluates what it loses by merging two symbols to make ensure it’s worth it.
So, WordPiece is optimized for a given training data. WordPiece will have lower vocab size and hence fewer parameters to train. Convergence will be faster. But this may not hold true when training-data is changed.
If your training-data is fixed or very similar to new training data, go for WordPiece.
If your training data changes substantially, go for BPE.

- 3,703
- 3
- 19
- 32