2

I am trying to apply BPE on a piece of text that is utf8 encoded.

Here is the code:

import io
from tokenizers import ByteLevelBPETokenizer
from tokenizers.decoders import ByteLevel

# list of the paths of your txt files

decoder = ByteLevel()

paths = ['my_corpus.txt']

tokenizer = ByteLevelBPETokenizer()

tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[ "<s>", "<pad>", "</s>", "<unk>", "<mask>", ])

tokens = tokenizer.encode(line)
print(tokens.tokens[1])

The problem is that because my_corpus.txt uses utf-8, then I get the decoded string to be garbled:

for example:

ítwa

changes to:

i Ìģ twa

You can see that the character encoding somehow changes (perhaps because BPE is done at the byte level?)

I thought that would help:

print(decoder.decode(tokens.tokens[1]))

but I get the same output.

Is there a way to run BPE where the atomic symbol is UTF-8 symbol (if what I suspect is happening is correct)?

kloop
  • 4,537
  • 13
  • 42
  • 66

0 Answers0