2

I am working with text which includes emoticons. I need to be able to find these and replace them with tags which can be analysed. How to do this?

> main$text[[4]]
[1] "Spread d wrd\xf0\u009f\u0098\u008e"
> grepl("\xf0", main$text[[4]])
[1] FALSE

I tried the above. Why did it not work? I also tried iconv into ASCII, then the byte encoding I got, could be searched with grepl.

> abc<-iconv(main$text[[4]], "UTF-8", "ASCII", "byte")
> abc
[1] "Spread d wrd<f0><9f><98><8e>"
> grepl("<f0>", abc)
[1] TRUE

I really do not understand what I did here and what happened. I also do not understand how the above conversion introduced \n characters into the text.

I also did not know how to encode these, once they were searcheable. I found a list here, but it fell short (for example, "U+E00E" - <ee><80><8e> was not in the list). Is there a comprehensive list for such a mapping?

ADDENDUM

After a lot of trial and error, here is what I realised. There are two kinds of encodings for the emojis in the data. One is in the form of bytes, which is searchable by grepl("\x9f", ...., useBytes=T), like the main$text[[4]], and another (main$text[[6]]) which is searchable as the unicode character without useBytes=T, i.e. grepl("\ue00e",....). Even the way they are displayed in View() and when called on the console is different. I am absolutely confused as to what is going on here.

 main$text[[4]]
[1] "Spread d wrd\xf0\u009f\u0098\u008e"
 main[4,]
            timestamp fromMe              remoteResource remoteResourceDisplayName type
b 2014-08-30 02:58:58  FALSE 112233@s.whatsapp.net                ABC text
                                      text   date
b Spread d wrd<f0><U+009F><U+0098><U+008E> 307114
 main$text[[6]]
[1] ""
 main[6,]
            timestamp fromMe              remoteResource remoteResourceDisplayName type     text
b 2014-08-30 02:59:17  FALSE 12345@s.whatsapp.net           XYZ text <U+E00E>
    date
b 307114
 grepl("\ue00e", main$text[[6]])
[1] TRUE
 grepl("<U+E00E>", main$text[[6]])
[1] FALSE
 grepl("\u009f", main$text[[4]])
[1] FALSE
 grepl("\x9f", main$text[[4]])
[1] FALSE
 grepl("\x9f", main$text[[4]], fixed=T)
[1] FALSE
 grepl("\x9f", main$text[[4]], useBytes=T)
[1] TRUE

The maps I have are also different. The one for the bytes case works well. But the other one doesnot, since I am unable to create the "\ue00e" required to search. Here is the sample of the other map, corresponding to the Softbank <U+E238>.

 emmm[11]
[1] "E238"
stochastic13
  • 423
  • 2
  • 15

1 Answers1

1

Searching for a single byte of a multi-byte UTF-8 encoded character only works if done with useBytes = TRUE. The fact that "\xf0" here is a part of a multi-byte character is obscured by the less than perfect Unicode support of R on Windows (used in the original example, I presume). How to match by bytes:

foo <- "\xf0\x9f\x98\x8e" # U+1F60E SMILING FACE WITH SUNGLASSES
Encoding(foo) <- "UTF-8"
grepl("\xf0", foo, useBytes = TRUE)

I don't see much use for matching one byte, though. Searching for the whole character would then be:

grepl(foo, paste0("Smiley: ", foo, " and more"), useBytes = TRUE)

Valid ASCII codes correspond to integers 0–127. The iconv() conversion to ASCII in the example replaces any invalid byte 0xYZ (corresponding to integers 128–255) with the literal text <yz> where y and z are hexadecimal digits. As far as I can see, it should not introduce any newlines ("\n").

Using the character list linked to in the question, here is some example code which performs one kind of "emoji tagging" to input strings, namely replacing the emoji with its (slightly formatted) name.

emoji_table <- read.csv2("https://github.com/today-is-a-good-day/Emoticons/raw/master/emDict.csv",
                         stringsAsFactors = FALSE)

emoji_names <- emoji_table[, 1]
text_bytes_to_raw <- function(x) {
    loc <- gregexpr("\\x", x, fixed = TRUE)[[1]] + 2
    as.raw(paste0("0x", substring(x, loc, loc + 1)))
}
emoji_raw <- lapply(emoji_table[, 3], text_bytes_to_raw)
emoji_utf8 <- vapply(emoji_raw, rawToChar, "")
Encoding(emoji_utf8) <- "UTF-8"

gsub_many <- function(x, patterns, replacements) {
    stopifnot(length(patterns) == length(replacements))
    x2 <- x
    for (k in seq_along(patterns)) {
        x2 <- gsub(patterns[k], replacements[k], x2, useBytes = TRUE)
    }
    x2
}

tag_emojis <- function(x, codes, names) {
    gsub_many(x, codes, paste0("<", gsub("[[:space:]]+", "_", names), ">"))
}

each_tagged <- tag_emojis(emoji_utf8, emoji_utf8, emoji_names)

all_in_one <- tag_emojis(paste0(emoji_utf8, collapse = ""),
                         emoji_utf8, emoji_names)

stopifnot(identical(paste0(each_tagged, collapse = ""), all_in_one))

As to why U+E00E is not on that emoji list, I don't think it should be. This code point is in a Private Use Area, where character mappings are not standardized. For comprehensive Unicode character lists, you cannot find a better authority than the Unicode Consortium, e.g. Unicode Emoji. Additionally, see convert utf8 code point strings like <U+0161> to utf8 .

Edit after addendum

When there is a string of exactly four hexadecimal digits representing a Unicode code point (let's say "E238"), the following code will convert the string to the corresponding UTF-8 representation, the occurrence of which can be checked with the grep() family of functions. This answers the question of how to "automatically" generate the character that can be manually created by typing "\uE238".

library(stringi)

hex4_to_utf8 <- function(x) {
    stopifnot(grepl("^[[:xdigit:]]{4}$", x))
    stringi::stri_enc_toutf8(stringi::stri_unescape_unicode(paste0("\\u", x)))
}

foo <- "E238"
foo_utf8 <- hex4_to_utf8(foo)

The value of the useBytes option should not matter in the following grep() call. In the previous code example, I used useBytes = TRUE as a precaution, as I'm not sure how well R on Windows handles Unicode code points U+10000 and larger (five or six digits). Clearly it cannot properly print such codepoints (as shown by the U+1F60E example), and input with the \U + 8 digits method is not possible.

The example in the question shows that R (on Windows) may print Unicode characters with the <U+E238> notation rather than as \ue238. The reason seems to be format(), also used in print.data.frame(). For example (R for Windows running on Wine):

> format("\ue238")
[1] "<U+E238>"

When tested in an 8-bit locale on Linux, the same notation is already used by the default print method. One must note that in this case, this is only a printed representation, which is different from how the character is originally stored.

Community
  • 1
  • 1
mvkorpel
  • 526
  • 6
  • 10
  • This solves the issue. I realised that some of the emoticons were encoded as their softbank encoding ("") and the rest as a series of bytes corresponding to the normal UTF-8 encodings ("\xf0\u009f\u0098"). Why would this be? – stochastic13 Jan 20 '17 at 14:24
  • An additional issue is, I have a map of all the softbank encodings. ("e00e","e051" ), but to make this searchable, i need to use `grepl("\ue00e",...)`. So how do I add the "\" to the string? I tried `paste` and `gsub` with `fixed= T`, but with no effect. – stochastic13 Jan 20 '17 at 14:27
  • If you have a literal string `abc <- "something with "` (upper case letters) and a "map" with the string `foo <- "e00e"` (lower case letters), then you would look for the character corresponding to `foo` by doing something like `grepl(paste0(""), abc, fixed = TRUE)`. But it's difficult to guess the exact problem you have. Maybe you would like to expand your question with a more detailed example. – mvkorpel Jan 20 '17 at 16:07
  • But the character there is not searchable by `grepl("", ...)`. I have added the relevant data to the question. The encodings in the dataset is extremely confusing. – stochastic13 Jan 22 '17 at 15:25
  • Please have a look at the addendum in the question. I'll be really grateful if you could provide any light on the core issue here. Sorry for the clumsy beginner-grade questions, but this is pretty confusing. – stochastic13 Jan 22 '17 at 15:38
  • 1
    I read the addendum and edited my answer accordingly. I hope this clears the confusion. – mvkorpel Jan 23 '17 at 10:40
  • Thanks. It clears up the confusion considerably. One last followup question. So the reason these apparently different encodings for the emojis exist, is an artifact of R's way of handling unicode code points rather than an artifact of my original dataset? – stochastic13 Jan 23 '17 at 14:47
  • One aspect is that there are the standardized and non-standard (private use area) emoji code points, but that's probably not what you meant. However, I think the answer to your question is yes, as R on Windows has incomplete support for (printing of) [supplementary code points](http://unicode.org/glossary/#supplementary_code_point). It seems that you have valid UTF-8 in your dataset. – mvkorpel Jan 24 '17 at 08:16