1

I use roberta-base tokenizer tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base',add_prefix_space=True) trained on english data to tokenize bengali just to see how it behaves . When I try to to encode a bengali character tokenizer.encode('বা') , I get [0, 1437, 35861, 11582, 35861, 4726, 2] which means that it finds some tokens in it vocabulary which match bengali characters even though train on english. On further exploration I find these are all special characters ['<s>', 'Ġ', 'à¦', '¬', 'à¦', '¾', '</s>'] . My question is why does it happen, isn't it supposed to output unknown tokens when applied on a new language ? Any help greatly appreciated

Soumya
  • 87
  • 1
  • 2
  • 15
  • 1
    Roberta uses a [byte level BPE](https://huggingface.co/transformers/tokenizer_summary.html). – cronoik Sep 07 '21 at 16:00
  • 2
    As mentioned before, RoBERTa uses a byte level BPE Tokenizer. That means, the input is tokenized on byte level and therefore finds tokens that represent `বা`. – cronoik Oct 17 '21 at 08:49

1 Answers1

4

As mentioned in the comments, the reason is that the RoBERTa tokenizer is byte-based, and not character-based.

In UTF-8, characters are represented by different numbers of bytes, which is heavily skewed towards the Latin alphabet: ASCII characters are single byte, the "longest" characters are up to four bytes. An example from Wikipedia:

Char | UTF code | Bytes
------------------------------
$    | U+0024   | 24
¢    | U+00A2   | C2 A2
ह    | U+0939   | E0 A4 B9
€    | U+20AC   | E2 82 AC
    | U+10348  | F0 90 8D 88

The SentecePiece tokenizer used by RoBERTa thus first segments the text into bytes, it is always possible and there are only 256 of them, so nothing is every OOV. Then, known groups of bytes are grouped into known tokens from the vocabulary.

SentencePiece also does special handling of spaces and special characters. First, it segments the text on special characters and spaces and replaces the spaces with a special character. In the original implementation, it was a special UTF-8 underscore , in the Huggingface implementation, it is Ġ. This special character is also prepended to the very beginning of the sentence, so words are consistently represented when they are at the beginning or in the middle of a sentence.

So, the output that you see basically is:

  1. The special space symbol prepended to every string (Ġ), and
  2. Four bytes representing your character,

which means that the character বা is not in the vocabulary, so it ends up being represented as four bytes, and bytes are always known.

Jindřich
  • 10,270
  • 2
  • 23
  • 44
  • `This special character is also prepended to the very beginning of the sentence, so words are consistently represented when they are at the beginning or in the middle of a sentence.` Afaik this is not the case as long as you do not force the tokenizer to do that with a parameter. `t.tokenize('hello world hello world') -> ['hello', 'Ġworld', 'Ġhello', 'Ġworld']` ` – cronoik Oct 23 '21 at 14:54