remove emoticons in R using tm package

Question

I'm using the tm package to clean up a Twitter Corpus. However, the package is unable to clean up emoticons.

Here's a replicated code:

July4th_clean <- tm_map(July4th_clean, content_transformer(tolower))
Error in FUN(content(x), ...) : invalid input 'RT ElleJohnson Love of country is encircling the globes ������������������ july4thweekend July4th FourthOfJuly IndependenceDay NotAvailableOnIn' in 'utf8towcs'

Can someone point me in the right direction to remove the emoticons using the tm package?

Thank you,

Luis

It is not clear from your example what you wish to eliminate. Do you want to eliminate substrings that contain multiple consecutive punctuation marks like :-) and (-_-) or are you trying to eliminate odd Unicode characters like ☺ and ❀ ? — G5W, Jul 03 '17 at 21:43
You are right. I assumed that it was a or something similar. — Luis, Jul 03 '17 at 23:03
I am a R newbie. Do you know how I could check that particular tweet? I imagine you use the [] but not sure if the function or any other part of the code. — Luis, Jul 03 '17 at 23:04

score 6 · Accepted Answer · answered Jul 04 '17 at 12:14

You can use gsub to get rid of all non-ASCII characters.

Texts = c("Let the stormy clouds chase, everyone from the place ☁  ♪ ♬",
    "See you soon brother ☮ ",
    "A boring old-fashioned message" ) 

gsub("[^\x01-\x7F]", "", Texts)
[1] "Let the stormy clouds chase, everyone from the place    "
[2] "See you soon brother  "                                  
[3] "A boring old-fashioned message"

Details: You can specify character classes in regex's with [ ]. When the class description starts with ^ it means everything except these characters. Here, I have specified everything except characters 1-127, i.e. everything except standard ASCII and I have specified that they should be replaced with the empty string.

What if you would like to keep characters with accents? For example, é and È. — Will M, May 21 '23 at 18:11
@WillM `gsub("[^\x01-\xFF]", "", Texts)` would leave the simple accented characters. — G5W, May 28 '23 at 13:04

score 1 · Answer 2 · answered Jul 03 '17 at 22:21

1

you can try this function

iconv(July4th_clean, "latin1", "ASCII", sub="")

Duplicate issue, see post

answered Jul 03 '17 at 22:21

zdeeb

142
9

Hi Zeyad, I did see that one but hesitated using it because the code was different than the tm code I was using. I was using the <- tm_map function. – Luis Jul 03 '17 at 23:06
you should run this before using the `tm` package – zdeeb Jul 04 '17 at 15:38

score 0 · Answer 3 · answered Feb 02 '23 at 19:08

I'm using rm_non_words from the qdapRegex package, because it allows me to remove emojis and emoticons while keeping characters in different European languages.

e.g.:

x1<-"Ελληνικά :), български, français ✅ "
x2<-rm_non_words(x1)
print(x2)
[1]"Ελληνικά български français"

remove emoticons in R using tm package

3 Answers3

Linked