3

I'm using the tm package to clean up a Twitter Corpus. However, the package is unable to clean up emoticons.

Here's a replicated code:

July4th_clean <- tm_map(July4th_clean, content_transformer(tolower))
Error in FUN(content(x), ...) : invalid input 'RT ElleJohnson Love of country is encircling the globes ������������������ july4thweekend July4th FourthOfJuly IndependenceDay NotAvailableOnIn' in 'utf8towcs'

Can someone point me in the right direction to remove the emoticons using the tm package?

Thank you,

Luis

DiegoS
  • 816
  • 1
  • 10
  • 26
Luis
  • 97
  • 2
  • 2
  • 10
  • It is not clear from your example what you wish to eliminate. Do you want to eliminate substrings that contain multiple consecutive punctuation marks like :-) and (-_-) or are you trying to eliminate odd Unicode characters like ☺ and ❀ ? – G5W Jul 03 '17 at 21:43
  • You are right. I assumed that it was a or something similar. – Luis Jul 03 '17 at 23:03
  • I am a R newbie. Do you know how I could check that particular tweet? I imagine you use the [] but not sure if the function or any other part of the code. – Luis Jul 03 '17 at 23:04
  • Hi G5W, the emoticon is a peach and a USA flag. – Luis Jul 04 '17 at 02:38
  • I am trying to eliminate odd Unicode characters. – Luis Jul 04 '17 at 02:39

3 Answers3

6

You can use gsub to get rid of all non-ASCII characters.

Texts = c("Let the stormy clouds chase, everyone from the place ☁  ♪ ♬",
    "See you soon brother ☮ ",
    "A boring old-fashioned message" ) 

gsub("[^\x01-\x7F]", "", Texts)
[1] "Let the stormy clouds chase, everyone from the place    "
[2] "See you soon brother  "                                  
[3] "A boring old-fashioned message"

Details: You can specify character classes in regex's with [ ]. When the class description starts with ^ it means everything except these characters. Here, I have specified everything except characters 1-127, i.e. everything except standard ASCII and I have specified that they should be replaced with the empty string.

G5W
  • 36,531
  • 10
  • 47
  • 80
1

you can try this function

iconv(July4th_clean, "latin1", "ASCII", sub="")

Duplicate issue, see post

zdeeb
  • 142
  • 9
  • Hi Zeyad, I did see that one but hesitated using it because the code was different than the tm code I was using. I was using the <- tm_map function. – Luis Jul 03 '17 at 23:06
  • you should run this before using the `tm` package – zdeeb Jul 04 '17 at 15:38
0

I'm using rm_non_words from the qdapRegex package, because it allows me to remove emojis and emoticons while keeping characters in different European languages.

e.g.:

x1<-"Ελληνικά :), български, français ✅ "
x2<-rm_non_words(x1)
print(x2)
[1]"Ελληνικά български français"