In most BPE(Byte-Pair Encoding) tutorials, it is mentioned to add </w>
after a word. The function of this mark is to distinguish whether a subword is a prefix of a word or a suffix of a word.
We know that the input of the model is a sequence of subwords(usually represented by ID), and the output of the model is naturally a sequence of subwords. But obviously the readability of this sequence is not strong, we still need to combine these subwords to get a normal sequence. The role of the </w>
mark is to merge subwords into words. Without </w>
, we naturally don't know the boundary of a word.
The BPE implementation of huggingface refers to the source code of openAI's gpt-2, I checked their source code carefully and found that there is no mark like </w>
, so how do we get a normal sequence during the decoding process?