Why are there strange characters in the embedding value?

Question

I am doing a simple text embedding task with the textEmbed function in r-text.

rm(list=ls())
Sys.setenv(LANG = "C.UTF-8", LC_ALL="C.UTF-8")
library(text)

temp <- textEmbed("I'm trying to do so good and I keep messing up my life. I hate it so much.", model="roberta-large", layers=23:24, dim_name = FALSE)

View(temp[["tokens"]][["texts"]][[1]])

In the result, the column "tokens" has strange characters "Ġ", "<s>", "</s>", "<pad>". And some of the embedding rows do not have values, only "NA" values.
Could anyone kindly help me find out why?

I have tried nothing to solve it yet.

These tokens are inserted for RoBERTa, see for a list of special tokens and their meaning https://medium.com/analytics-vidhya/create-a-tokenizer-and-train-a-huggingface-roberta-model-from-scratch-f3ed1138180c for example. — Marijn, Jan 17 '23 at 18:18

score 1 · Answer 1 · answered Jan 17 '23 at 19:51

Thanks to the comments below the question. These symbols are from the tokenizer used in RoBERTa.

https://medium.com/analytics-vidhya/create-a-tokenizer-and-train-a-huggingface-roberta-model-from-scratch-f3ed1138180c

"< s >" or BOS, beginning Of Sentence
"< /s >" or EOS, End Of Sentence
"< pad >" the padding token
"Ġ" k

Why are there strange characters in the embedding value?

1 Answers1