Let's say we train BPE tokenizer on this string: D C B B A B C D C B A B C D
As I understand it merges the most frequent pairs, but what will the algorithm merge here first? DC, BC, CD, BA, or AB? All occur 2 times in this dummy corpus.
Seems like this is important because it will define the final vocabulary. And this leads to the question how BPE handles this cases? And do these rules vary across implementations?
For the full picture, I'm using HugginFace implementation - ByteLevelBPETokenizer. The way how it tokenizes the string is always like this: ['d', 'c', 'b', 'babcd', 'c', 'babcd']. However, I don't understand why it starts merging from BA (it seems like this) and what these rules are.