1

In most BPE(Byte-Pair Encoding) tutorials, it is mentioned to add </w> after a word. The function of this mark is to distinguish whether a subword is a prefix of a word or a suffix of a word.

We know that the input of the model is a sequence of subwords(usually represented by ID), and the output of the model is naturally a sequence of subwords. But obviously the readability of this sequence is not strong, we still need to combine these subwords to get a normal sequence. The role of the </w> mark is to merge subwords into words. Without </w>, we naturally don't know the boundary of a word.

The BPE implementation of huggingface refers to the source code of openAI's gpt-2, I checked their source code carefully and found that there is no mark like </w>, so how do we get a normal sequence during the decoding process?

  • None of the BPE tutorials or implementations I have seen, including the [original article](https://aclanthology.org/P16-1162/), uses an end-of-word marker. Can you provide some examples? – noe Aug 13 '23 at 09:40
  • @noe The `` in Algorithm 1 in the paper you mentioned represents the end of a word. – korangar leo Aug 13 '23 at 11:28
  • Ahh, I see the problem now. That marker is part of the token, not a token per se. Once the BPE vocabulary creation is finished, you normally invert the mark: you mark tokens that lack of the end-of-word marker. In the [original implementation](https://github.com/rsennrich/subword-nmt), it was marked as a `@@`. That's why to restore the original implementation you simply had to remove the occurrences "@@ ", so that the tokens that belonged to the same words were attached together. That's why you won't see any end-of-word marker in BPE vocabularies. – noe Aug 13 '23 at 13:33
  • I have added an answer with this information. – noe Aug 13 '23 at 13:35

1 Answers1

0

The end of word marker </w> is part of the tokens during the creation of a vocabulary, not a token per se.

Once the BPE vocabulary creation is finished, you normally invert the mark: you mark tokens that lack of the end-of-word marker. In the original implementation, the lack of end-of-word marker was expressed as @@. That's why to restore the original implementation you simply had to remove the occurrences "@@ ", so that the tokens that belonged to the same words were attached together.

In the HuggingFace implementation, they mimick OpenAI's implementation and use a slightly different approach, representing the space as part of the tokens themselves. For this, they use the \u0120 marker, which you can see in the GPT-2 vocabulary at the beginning of many tokens. You can see details about this in this github issue. This huggingface disccussion shares some context on this.

That's why you won't see any end-of-word marker in BPE vocabularies.

noe
  • 1,684
  • 1
  • 17
  • 35
  • Thank you for your detailed answer. I still have a doubt, for a subword algorithm like this, is it necessary to add some markers that can determine the word boundaries, otherwise it will not be possible to decode the subword sequence into a word sequence? Like BPE, some people like to add it in front of the word, and some people like to add it after the word. Either way can determine the boundary of the word. The way WordPiece marks the boundary seems to be very consistent. It only marks those subwords that are not the beginning of the word, that is, add `##` in front of the subword. – korangar leo Aug 13 '23 at 15:31
  • Whether or not spaces are needed depends on the specific implementation. In principle, BPE does not need to know word boundaries, and it is possible to use it with text that has no explicit separation between words, like Chinese. – noe Aug 15 '23 at 07:59
  • Nevertheless, it is usual that the implementation is optimized for using word boundaries (see [this](https://github.com/rsennrich/subword-nmt/issues/53#issuecomment-405822372) and [this](https://github.com/google/sentencepiece#whitespace-is-treated-as-a-basic-symbol)) and, for that, it is also usual to have a previous step to split text into words, separating them with spaces (e.g. with [sacremoses](https://github.com/alvations/sacremoses) in English and Romance languages, [Jieba](https://github.com/fxsjy/jieba) in Chinese). – noe Aug 15 '23 at 07:59
  • @korangarleo if you find the answer useful, please consider upvoting it. Also, please consider accepting it (with the tick mark ✓ next to it) if you deem it correct or, alternatively, please describe in a comment why you consider it incorrect or not clear enough. – noe Aug 29 '23 at 11:34