I have downloaded the Norwegian BERT-model from https://github.com/botxo/nordic_bert, and loaded it in using:
import transformers as t
model_class = t.BertModel
tokenizer_class = t.BertTokenizer
tokenizer = tokenizer_class.from_pretrained(/PATH/TO/MODEL/FOLDER)
model = model_class.from_pretrained(/PATH/TO/MODEL)
model.eval()
This works very well, however when i try to tokenize a given sentence, some nordic characters such as "ø" and "æ" remain the same, whereas all words having the char "å" is replaced with "a". For instance:
s = "æ ø å løpe få ærfugl"
print(tokenizer.tokenize(s))
Yields:
['æ', 'ø', 'a', 'løp', '##e', 'fa', 'ær', '##fugl']
Thanks