1

I have a dataset with strings (data$text) containing names of emojis instead of actual images (e.g., FACE_WITH_TEARS_OF_JOY). Now I'm trying to replace each emoji name with the actual emoji. The names and emojis are saved in an extra dataset which works as "dictionary" (emojis$name and emojis$emoji).

So this is my dataset:

data <- structure(list(text = c("blabla HUGGING_FACE PARTY_POPPER", "bla FACE_WITH_TEARS_OF_JOY bla FACE_WITH_TEARS_OF_JOY", "PARTY_POPPER")), class = "data.frame", row.names = c(NA, -3L))

looking like:

                                                   text
1                      blabla HUGGING_FACE PARTY_POPPER
2 bla FACE_WITH_TEARS_OF_JOY bla FACE_WITH_TEARS_OF_JOY
3                                          PARTY_POPPER

Note that the emoji names are just part of the text. The rest oft the text should remain.

And this is my "dictionary":

emojis <- structure(list(name = c("FACE_WITH_TEARS_OF_JOY", "HUGGING_FACE", 
                                  "PARTY_POPPER"), emoji = c("\U0001f602", "\U0001f917", "\U0001f389"
                                  )), class = "data.frame", row.names = c(NA, -3L))

looking like:

                    name      emoji
1 FACE_WITH_TEARS_OF_JOY \U0001f602
2           HUGGING_FACE \U0001f917
3           PARTY_POPPER \U0001f389

For a single emoji this code works:

data$text <- gsub("FACE_WITH_TEARS_OF_JOY", "\U0001f602", data$text)

the result is:

                              text
1 blabla HUGGING_FACE PARTY_POPPER
2    bla \U0001f602 bla \U0001f602
3                     PARTY_POPPER

However, I want to replace the other emoji names as well. The result should be:

                           text
1  blabla \U0001f917 \U0001f389
2 bla \U0001f602 bla \U0001f602
3                    \U0001f389

As there are thousands of emojis, I need something like:

data$text <- gsub(emojis$name, emojis$emoji, data$text)

This doesn't work (error: "argument 'pattern' has length > 1 and only the first element will be used numeric ") and I couldn't find a solution myself.

Thanks in advance!

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Please read the info at the top of the [tag:r] tag page and note in particular that minimal but complete reproducible code and input (using `dput`) are needed. No one can run this code as the input is missing. – G. Grothendieck Nov 13 '21 at 16:10
  • Thanks for letting me now – I've edited the question and hope it's better now. – just_asking Nov 14 '21 at 17:04

3 Answers3

0

You can use function mapvalues in package plyr. Example:

library(plyr)

# data
data <- data.frame("ID" = 1:5, "text" = c("FACE_WITH_TEARS", "FACE_WITH_JOY",
                      "FACE_WITH_JOY", "FACE_WITH_PLAIN", "FACE_WITH_TEARS"))

# "dictionary" 
emojis <- data.frame("name" = c("FACE_WITH_TEARS", "FACE_WITH_JOY", "FACE_WITH_PLAIN"),
                     "emojis" = c("CRY", "HAPPY", "NUETRAL"))

data$text <- mapvalues(data$text, emojis$name, emojis$emojis)

data

The result is

  ID    text
1  1     CRY
2  2   HAPPY
3  3   HAPPY
4  4 NUETRAL
5  5     CRY
bdedu
  • 383
  • 1
  • 8
  • Many thanks! This seems to only work when there is no text around the emoji name. It does, e.g., replace "PARTY_POPPER" with "\U0001f389" but not "blabla PARTY_POPPER" with "blabla \U0001f389". Sorry that I haven't made my question clearer. I've edited it now. – just_asking Nov 14 '21 at 17:08
0

You can also use stringr::str_replace_all with setNames to create a dictionary out of your emojis dataframe:

data <- structure(list(text = c("blabla HUGGING_FACE PARTY_POPPER", "bla FACE_WITH_TEARS_OF_JOY bla FACE_WITH_TEARS_OF_JOY", "PARTY_POPPER")), class = "data.frame", row.names = c(NA, -3L))
emojis <- structure(list(name = c("FACE_WITH_TEARS_OF_JOY", "HUGGING_FACE", 
                                  "PARTY_POPPER"), emoji = c("\U0001f602", "\U0001f917", "\U0001f389"
                                  )), class = "data.frame", row.names = c(NA, -3L))

library(stringr)
stringr::str_replace_all(data$text, setNames(emojis$emoji, emojis$name))

See the online R demo.

Output:

[1] "blabla  "  "bla  bla " "" 
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks as well. Different from your test case, in my dataset str_replace_all only works for cases without text around the emoji name. It does, e.g., replace "PARTY_POPPER" with "\U0001f389" but not "blabla PARTY_POPPER" with "blabla \U0001f389". Sorry that I haven't made my question clearer. I've edited it now and provided a reproducible data set. – just_asking Nov 14 '21 at 17:17
  • @just_asking Use the updated code. – Wiktor Stribiżew Nov 14 '21 at 17:26
0

1) gsubfn Create a dictionary, dict, consisting of a list whose names are the names to replace and whose values are the values to replace them with. Then use gsubfn to perform replacements of strings of non-spaces, "\\S+", using the dictionary. gsubfn takes the same arguments as gsub except the second argument can be such a list (or certain other objects).

library(gsubfn)

dict <- with(emojis, setNames(as.list(emoji), name))
gsubfn("\\S+", dict, data$text)
## [1] "blabla  "  "bla  bla " ""   

2) Base R This uses Reduce to iterate through the rows of emojis replacing them one by one.

gsub_ <- function(s, i) with(emojis[i, ], gsub(name, emoji, s))
Reduce(gsub_, init = data$text, 1:nrow(emojis))
## [1] "blabla  "  "bla  bla " ""       
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341