2

At each iteration the WordPiece algorithm for subword tokenization merges the two symbols which increase the likelihood the most. Now, in the literature it is only mentioned that this likelihood is the likelihood of the language model (e.g., the same likelihood used during decoding, in case of NMT). Does anyone know which likelihood was used for pre-processing of BERT?

SweetSpot
  • 101
  • 2

0 Answers0