I am trying to apply BPE on a piece of text that is utf8 encoded.
Here is the code:
import io
from tokenizers import ByteLevelBPETokenizer
from tokenizers.decoders import ByteLevel
# list of the paths of your txt files
decoder = ByteLevel()
paths = ['my_corpus.txt']
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[ "<s>", "<pad>", "</s>", "<unk>", "<mask>", ])
tokens = tokenizer.encode(line)
print(tokens.tokens[1])
The problem is that because my_corpus.txt uses utf-8, then I get the decoded string to be garbled:
for example:
ítwa
changes to:
i Ìģ twa
You can see that the character encoding somehow changes (perhaps because BPE is done at the byte level?)
I thought that would help:
print(decoder.decode(tokens.tokens[1]))
but I get the same output.
Is there a way to run BPE where the atomic symbol is UTF-8 symbol (if what I suspect is happening is correct)?