I am trying to clean my text data and replace Emojis with words so that I can perform a sentiment analysis later on.
Therefore, I am using the replace_emoji
function from the textclean package. This should replace all emojis with their corresponding words.
The dataset I am working with is a text corpus, that is also the reason why I used the VCorpus
function from the tm package in my sample code below:
text <- "text goes here bla bla <u+0001f926><u+0001f3fd><u+200d><u+2640><u+fe0f>" #text with emojis
text.corpus <- VCorpus(VectorSource(text)) #Transforming into corpus
text.corpus <- tm_map(text.corpus, content_transformer(function(x) replace_emoji(x, emoji_dt = lexicon::hash_emojis))) #This function should change Emojis into words
inspect(text.corpus[[1]]) #inspecting the corpus shows that the Unicode was NOT replaced with words
head(hash_emojis) #This shows that the encoding in the lexicon is different than the encoding in my text data.
Although the function itself works, it does not replace emojis in my text as it seems that the Encoding within the "hash_emojis" dataset is different than the one I have in my data. Thus, the function does not replace the Emojis into words. I have also tried to convert the "hash_emojis" data by using the iconv
function but unfortunately did not manage to change the encoding.
I would like to replace the Unicode values are shown in my dataset with words.