How does Byte-pair Encoding handle equally frequent pairs?

Asked Sep 07 '22 at 01:33

Active Sep 07 '22 at 12:23

Viewed 28 times

Let's say we train BPE tokenizer on this string: D C B B A B C D C B A B C D

As I understand it merges the most frequent pairs, but what will the algorithm merge here first? DC, BC, CD, BA, or AB? All occur 2 times in this dummy corpus.

Seems like this is important because it will define the final vocabulary. And this leads to the question how BPE handles this cases? And do these rules vary across implementations?

For the full picture, I'm using HugginFace implementation - ByteLevelBPETokenizer. The way how it tokenizes the string is always like this: ['d', 'c', 'b', 'babcd', 'c', 'babcd']. However, I don't understand why it starts merging from BA (it seems like this) and what these rules are.

edited Sep 07 '22 at 12:23

desertnaut

57,590
26
140
166

asked Sep 07 '22 at 01:33

Nikolay Klimenko

How does Byte-pair Encoding handle equally frequent pairs?

0 Answers0