I am working with text which includes emoticons. I need to be able to find these and replace them with tags which can be analysed. How to do this?
> main$text[[4]]
[1] "Spread d wrd\xf0\u009f\u0098\u008e"
> grepl("\xf0", main$text[[4]])
[1] FALSE
I tried the above. Why did it not work? I also tried iconv
into ASCII, then the byte encoding I got, could be searched with grepl.
> abc<-iconv(main$text[[4]], "UTF-8", "ASCII", "byte")
> abc
[1] "Spread d wrd<f0><9f><98><8e>"
> grepl("<f0>", abc)
[1] TRUE
I really do not understand what I did here and what happened. I also do not understand how the above conversion introduced \n
characters into the text.
I also did not know how to encode these, once they were searcheable. I found a list here, but it fell short (for example, "U+E00E" - <ee><80><8e>
was not in the list). Is there a comprehensive list for such a mapping?
ADDENDUM
After a lot of trial and error, here is what I realised. There are two kinds of encodings for the emojis in the data. One is in the form of bytes, which is searchable by grepl("\x9f", ...., useBytes=T)
, like the main$text[[4]]
, and another (main$text[[6]]
) which is searchable as the unicode character without useBytes=T
, i.e. grepl("\ue00e",....)
. Even the way they are displayed in View()
and when called on the console is different. I am absolutely confused as to what is going on here.
main$text[[4]]
[1] "Spread d wrd\xf0\u009f\u0098\u008e"
main[4,]
timestamp fromMe remoteResource remoteResourceDisplayName type
b 2014-08-30 02:58:58 FALSE 112233@s.whatsapp.net ABC text
text date
b Spread d wrd<f0><U+009F><U+0098><U+008E> 307114
main$text[[6]]
[1] ""
main[6,]
timestamp fromMe remoteResource remoteResourceDisplayName type text
b 2014-08-30 02:59:17 FALSE 12345@s.whatsapp.net XYZ text <U+E00E>
date
b 307114
grepl("\ue00e", main$text[[6]])
[1] TRUE
grepl("<U+E00E>", main$text[[6]])
[1] FALSE
grepl("\u009f", main$text[[4]])
[1] FALSE
grepl("\x9f", main$text[[4]])
[1] FALSE
grepl("\x9f", main$text[[4]], fixed=T)
[1] FALSE
grepl("\x9f", main$text[[4]], useBytes=T)
[1] TRUE
The maps I have are also different. The one for the bytes case works well. But the other one doesnot, since I am unable to create the "\ue00e"
required to search. Here is the sample of the other map, corresponding to the Softbank <U+E238>
.
emmm[11]
[1] "E238"