Replace Emojis in R with replace_emoji() function does not work due to different encoding - UTF8/Unicode?

Question

I am trying to clean my text data and replace Emojis with words so that I can perform a sentiment analysis later on.

Therefore, I am using the replace_emoji function from the textclean package. This should replace all emojis with their corresponding words.

The dataset I am working with is a text corpus, that is also the reason why I used the VCorpus function from the tm package in my sample code below:

text <- "text goes here bla bla <u+0001f926><u+0001f3fd><u+200d><u+2640><u+fe0f>" #text with emojis

text.corpus <- VCorpus(VectorSource(text)) #Transforming into corpus
text.corpus <- tm_map(text.corpus, content_transformer(function(x) replace_emoji(x, emoji_dt = lexicon::hash_emojis)))  #This function should change Emojis into words

inspect(text.corpus[[1]]) #inspecting the corpus shows that the Unicode was NOT replaced with words

head(hash_emojis) #This shows that the encoding in the lexicon is different than the encoding in my text data.

Although the function itself works, it does not replace emojis in my text as it seems that the Encoding within the "hash_emojis" dataset is different than the one I have in my data. Thus, the function does not replace the Emojis into words. I have also tried to convert the "hash_emojis" data by using the iconv function but unfortunately did not manage to change the encoding.

I would like to replace the Unicode values are shown in my dataset with words.

Problems of `Encoding` / storing emojis in objects :-(. Hate these issues as they can be OS dependent. If I test "text goes here bla bla " the code works on my machine. Also works with "text goes here bla bla \U0001f600\U0001f602". But "text goes here bla bla " does not work. The first 2 are UTF8, but the last one will show encoding "unknown" even though this is the unicode representation of the emojis. Testing with <9F><98><81> also works, which tells me the emoji table is in UTF8 format and doesn't handle the unicode representation of your data. — phiver, Jun 07 '20 at 18:22
hey @phiver. Yes, exactly! I have tried it with variations as well, all of which worked - but as you mentioned, for some reason it does not take the unicode representation of the emojis (unfortunately, this is the representation in my dataset). Thus, I also thought about converting the encoding which is stored in the lexicon/hash_emojis dataset into unicode so that it can recognize the representation in my data. However, I was unable to do that as well. — lole_emily, Jun 07 '20 at 19:29

score 1 · Accepted Answer · answered Jun 08 '20 at 09:47

I found an answer to your question. I will mark this one as a duplicate later today when you read my answer.

Using my example:

library(stringi)
library(magrittr)

"text goes here bla bla <u+0001F600><u+0001f602>"  %>% 
  stri_replace_all_regex("<u\\+([[:alnum:]]{4})>", "\\\\u$1") %>% 
  stri_replace_all_regex("<u\\+([[:alnum:]]{5})>", "\\\\U000$1") %>% 
  stri_replace_all_regex("<u\\+([[:alnum:]]{6})>", "\\\\U00$1") %>% 
  stri_replace_all_regex("<u\\+([[:alnum:]]{7})>", "\\\\U0$1") %>% 
  stri_replace_all_regex("<u\\+([[:alnum:]]{8})>", "\\\\U$1") %>% 
  stri_replace_all_regex("<u\\+([[:alnum:]]{1})>", "\\\\u000$1") %>% 
  stri_replace_all_regex("<u\\+([[:alnum:]]{2})>", "\\\\u00$1") %>% 
  stri_replace_all_regex("<u\\+([[:alnum:]]{3})>", "\\\\u0$1") %>% 
  stri_unescape_unicode() %>% 
  stri_enc_toutf8() %>% 
  textclean::replace_emoji()

[1] "text goes here bla bla grinning face face with tears of joy "

Now be carefull of the unicode representation. The example answer has the "U" in upper case, I changed this to lower case "u" to reflect your example.

To combine everything:

# create a function to use within tm_map
unicode_replacement <- function(text) {
  text %>% 
    stri_replace_all_regex("<u\\+([[:alnum:]]{4})>", "\\\\u$1") %>% 
    stri_replace_all_regex("<u\\+([[:alnum:]]{5})>", "\\\\U000$1") %>% 
    stri_replace_all_regex("<u\\+([[:alnum:]]{6})>", "\\\\U00$1") %>% 
    stri_replace_all_regex("<u\\+([[:alnum:]]{7})>", "\\\\U0$1") %>% 
    stri_replace_all_regex("<u\\+([[:alnum:]]{8})>", "\\\\U$1") %>% 
    stri_replace_all_regex("<u\\+([[:alnum:]]{1})>", "\\\\u000$1") %>% 
    stri_replace_all_regex("<u\\+([[:alnum:]]{2})>", "\\\\u00$1") %>% 
    stri_replace_all_regex("<u\\+([[:alnum:]]{3})>", "\\\\u0$1") %>% 
    stri_unescape_unicode() %>% 
    stri_enc_toutf8()
}

library(tm)
library(textclean)
text.corpus <- VCorpus(VectorSource(text)) #Transforming into corpus
text.corpus <- tm_map(text.corpus, content_transformer(unicode_replacement))
text.corpus <- tm_map(text.corpus, content_transformer(function(x) replace_emoji(x, emoji_dt = lexicon::hash_emojis)))  

inspect(text.corpus[[1]]) 

<<PlainTextDocument>>
Metadata:  7
Content:  chars: 92

text goes here bla bla <f0><9f><a4><a6><f0><9f><8f><bd><e2><80><8d> female sign <ef><b8><8f>

Now using your example you get the above outcome. Checking the emoji tables, your unicode examples do not appear in the table except for the female sign. But that is another issue. If I use "text goes here bla bla " the outcome is as expected.

Hi @phiver - you really helped and saved the day - once again! :-) Thank you so much! It worked perfectly. — lole_emily, Jun 08 '20 at 14:30

Replace Emojis in R with replace_emoji() function does not work due to different encoding - UTF8/Unicode?

1 Answers1