3

With BPE or WordPiece there might be multiple ways to encode a word. For instance, assume (for simplicity) the token vocabulary contains all letters as well as the merged symbols ("to", "ke", "en"). Then the word "token" could be encoded as ("to", "ke", "n") or ("to", "k", "en"). Such ambiguous encodings are also mentioned in this tutorial https://blog.floydhub.com/tokenization-nlp/

However, in the hugginface tutorial it is mentioned that "BPE and WordPiece [...] work out rules in a certain order that you can then apply in the same order when tokenizing new text", see https://huggingface.co/transformers/master/tokenizer_summary.html.

How exactly are these rules stored and applied when using BPE/WordPiece, e.g., in my example above, how is it determined which tokenization to use?

SweetSpot
  • 101
  • 2
  • It just means you may use a BPE or a WordPiece (or SentencePiece) model to encode some text and then decode to obtain the original text. If you are training from scratch choose any, when you train incrementally, you will need to apply the same tokenization scheme. – Wiktor Stribiżew Aug 05 '20 at 11:12
  • Ok thanks, but lets say I have used BPE/WordPiece for pre-processing and then trained a language model like GPT or BERT. Now I apply the trained model to a new text, which contains an ambiguous word ("token" in my example). It obv makes a difference now how this word is processed, regarding the prediction made by the model. So how is the encoding of the word determined? – SweetSpot Aug 05 '20 at 11:21
  • My guess is that BPE/WordPiece always use the largest units possible. However, sometimes all possible subword tokenizations might have the same length (e.g., as in my example) – SweetSpot Aug 05 '20 at 11:44
  • I do not think you can have the same word tokenized in a different way. Even if it is, it should not be a problem. – Wiktor Stribiżew Aug 05 '20 at 11:49

2 Answers2

0

In the parsing step of BPE, the merging order matters. For instance, if the merging order is

(p, e), (pe, n), (pen, _), (a, p), (ap, p), (app, l), (appl, e), (apple, _), (pen, apple_)

Applepen PenapplePen should be segmented into this: [a, p, p, l, e, pe, pen, a, p, p, l, e, pen], given k = 2. We just use (p, e), (pe, n) for parsing. Since the merging order is fixed, the result should be fixed for the test data for any k. You just use the first k merges in the parsing step.

For the details please refer to my answer to the question: Explain bpe (Byte Pair Encoding) with examples?

Lerner Zhang
  • 6,184
  • 2
  • 49
  • 66
0

The BPE algorithm learns the merge rules in a particular order based on the frequencies of subtokens. This ordering is used greedly during the encoding process for new text.

Considering the example from above, let's say the pair (e, n) appears 10 times in your training corpus, (t, o) appears 6 times and (k, e) appears 4 times. The BPE algorithm will learn 3 rules and apply them in the following order:

1. e, n -> en
2. t, o -> to
3. k, e -> ke

Encoding the new text token doesn't proceed from left to right, but instead it applies the rules based on their order. Therefore, the text will be encoded as follows:

Rule 1: t o k e n -> t o k en
Rule 2: t o k en -> t o k en
Rule 3: t o k en -> to k en

The Byte-Pair Encoding tokenization course on huggingface provides a reference implementation.

Paul Baltescu
  • 2,095
  • 4
  • 25
  • 30