1

I am doing a simple text embedding task with the textEmbed function in r-text.

rm(list=ls())
Sys.setenv(LANG = "C.UTF-8", LC_ALL="C.UTF-8")
library(text)

temp <- textEmbed("I'm trying to do so good and I keep messing up my life. I hate it so much.", model="roberta-large", layers=23:24, dim_name = FALSE)

View(temp[["tokens"]][["texts"]][[1]])

In the result, the column "tokens" has strange characters "Ġ", "<s>", "</s>", "<pad>". And some of the embedding rows do not have values, only "NA" values.
Could anyone kindly help me find out why?

I have tried nothing to solve it yet.

AlexGu
  • 41
  • 4
  • 2
    These tokens are inserted for RoBERTa, see for a list of special tokens and their meaning https://medium.com/analytics-vidhya/create-a-tokenizer-and-train-a-huggingface-roberta-model-from-scratch-f3ed1138180c for example. – Marijn Jan 17 '23 at 18:18
  • Thanks, it is a great help. – AlexGu Jan 17 '23 at 19:48

1 Answers1

1

Thanks to the comments below the question. These symbols are from the tokenizer used in RoBERTa.

https://medium.com/analytics-vidhya/create-a-tokenizer-and-train-a-huggingface-roberta-model-from-scratch-f3ed1138180c

  • "< s >" or BOS, beginning Of Sentence
  • "< /s >" or EOS, End Of Sentence
  • "< pad >" the padding token
  • "Ġ" k
AlexGu
  • 41
  • 4