HuggingFace Transformers: BertTokenizer changing characters

Question

I have downloaded the Norwegian BERT-model from https://github.com/botxo/nordic_bert, and loaded it in using:

import transformers as t

model_class = t.BertModel
tokenizer_class = t.BertTokenizer

tokenizer = tokenizer_class.from_pretrained(/PATH/TO/MODEL/FOLDER)
model = model_class.from_pretrained(/PATH/TO/MODEL)
model.eval()

This works very well, however when i try to tokenize a given sentence, some nordic characters such as "ø" and "æ" remain the same, whereas all words having the char "å" is replaced with "a". For instance:

s = "æ ø å løpe få ærfugl"
print(tokenizer.tokenize(s))

Yields:

['æ', 'ø', 'a', 'løp', '##e', 'fa', 'ær', '##fugl']

Thanks

When you check the vocab.txt, you will see that `å` is not a token. Therefore the tokenizer can't produce it. Is `å` a single word? Because `å` is part of other tokens. — cronoik, Jul 29 '20 at 12:53
I managed to solve it, å was indeed in vocab.txt, so the problem wasn't there. It worked by using the BerttokenizerFast and setting strip_accents = False. It appears as the error was in unicode.normalize in the strip accents function :) — Christian Vennerød, Jul 30 '20 at 13:12
When I run `grep ^å vocab.txt` it returns nothing. That means `å` is not in the vocab.txt. — cronoik, Jul 30 '20 at 15:36
Have you downloaded and used the Norwegian model? It’s based on Norwegian words and characters. Pressed ctrl f and found å in vocab, weirdly enough — Christian Vennerød, Jul 31 '20 at 16:09
Of course the `å` is in the vocab.txt of the Norwegian model (975 times to be exact), but that doesn't mean that it also a single token (i.e. entry of the vocabulary). I have also looked closer into it and think that it is not entirely compatible with huggingface tokenizer because the provided vocab.txt contains only subword tokens (indicated with ##TOKEN) and no single words (by definition of huggingface entries which don't start with ##). They are also not refering to hugginface but to the official google bert github which has it's own tokenizer and uses tensorflow. — cronoik, Jul 31 '20 at 20:26
When I execute `print(tokenizer.tokenize(s))` with version 3.0.2 it produces the correct result following the explanation above which is: `['[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]']` — cronoik, Jul 31 '20 at 20:44
Yeah you are correct about that regarding the vocab.txt file. I changed the format of that file to make it match Huggingface's format, which worked. The vocab.txt file is useless if one doesn't do that and want to use it with bert from huggingface. Won't spend too much time arguing on this, but the tokenizer changed all 'å' into 'a' but kept all 'ø' and 'æ' until I set strip_accents into false. It didn't work until I did that. — Christian Vennerød, Aug 05 '20 at 17:44
I also confirmed this through running the following code on my own computer. `text = unicodedata.normalize("NFD", text)` and `cat = unicodedata.category("Å")` - taken from the huggingface FastBertTokenizer source code.. — Christian Vennerød, Aug 05 '20 at 17:51

Christian Vennerød · Accepted Answer · 2020-08-05T17:57:21.097

2

It worked by using the BerttokenizerFast and setting strip_accents = False. It appears as the error was in unicode.normalize in the strip accents function.

Naturally, one has to alter the vocab.txt file to make it match the Bert tokenizer format.

edited Aug 05 '20 at 17:57

answered Jul 30 '20 at 13:13

Christian Vennerød

21
4

HuggingFace Transformers: BertTokenizer changing characters

1 Answers1