How to remove Unicode representations of Emojis in strings using regexp in R?

Question

I am working with data from the Twitter API and wherever users had included Emojis in their name field, they have been translated to Unicode string representations in my dataframe. The structure of my data is somewhat like this:

user_profiles <- as.data.frame(c("Susanne Bold", "Julian K. Peard <U+0001F41C>", 
"<U+0001F30A> Alexander K Miller <U+0001F30A>", "John Mason"))
colnames(user_profiles) <- "name"

which looks like this:

                                          name
1                                 Susanne Bold
2                 Julian K. Peard <U+0001F41C>
3 <U+0001F30A> Alexander K Miller <U+0001F30A>
4                                   John Mason

I am now trying to isolate the actual name into a new column using regexp:

user_profiles <- user_profiles %>%
  mutate(clean_name = str_remove_all(name, "\\<U\\+[[:alnum:]]\\>[ ]?"))

But this expression 1. seems rather complicated and 2. doesn't work for identifying the pattern. I have tried multiple variations of the regexp already, weirdly enough, grepl is able to detect the pattern with this version (which string_remove_all doesn't accept since it is missing a closing bracket):

grepl("\\<U\\+[[:alnum:]\\>[ ]?", user_profiles$name)
[1] FALSE  TRUE  TRUE FALSE
# note that the second bracket around alnum is left opened

Can somebody explain this or offer an easier solution?

Thanks a lot!

Basically, the first str_remove_all does not work because you missed `+` after the alphanumeric pattern. ``mutate(clean_name = str_remove_all(name, "\\s*"))`` works. However, I think you need `"\\s*"` that matches hex chars only, not any alphanumeric chars. — Wiktor Stribiżew, Apr 09 '22 at 18:24
In the second `grepl("\\[ ]?", user_profiles$name)`, the regex is just "corrupt" or "malformed", it does not match what you wanted it to match. — Wiktor Stribiżew, Apr 09 '22 at 18:36

score 2 · Answer 1 · answered Apr 09 '22 at 18:28

Here is an alternative way how we could do it:

library(dplyr)
library(tidyr)

user_profiles %>% 
  separate_rows(name, sep = '\\<|\\>') %>% 
  filter(!str_detect(name, 'U+')) %>% 
  mutate(name = na_if(name, "")) %>% 
  na.omit()

  name                  
  <chr>                 
1 "Susanne Bold"        
2 "Julian K. Peard "    
3 " Alexander K Miller "
4 "John Mason"

Wiktor Stribiżew · Accepted Answer · 2022-04-09T18:50:24.373

The first str_remove_all does not work because you missed the + quantifier after the alphanumeric pattern. Also, note that after <U+, only hex chars are used, so instead of [:alnum:], you can use a more precise [:xdigit:] POSIX character class.

You can use

user_profiles <- user_profiles %>%
  mutate(clean_name = str_remove_all(name, "<U\\+[[:xdigit:]]+>\\s*"))

Do not escape < and >, they are never special in any regex flavor, and in TRE regex, used with base regex functions without perl=TRUE, the \< and \> are word boundaries.

Pattern details

<U - <U string
\+ - a literal +
[[:xdigit:]]+ - one or more hex chars
> - a > char
\s* - zero or more whitespaces.

Why does the grepl regex work? This is interesting, because you omitted the ] closing bracket expression boundary char, and "spoilt" the regex to match like this:

\<U\+ - a word boundary (in TRE, \< matches a left-hand word boundary) and then U+ string
[[:alnum:]\>[ ]? - this is an optional bracket expression that matches one or zero chars from the set:
- [:alnum:] - any alphanumeric char
- \ - a backslash (yes, because in TRE regex flavor, regex escape sequences are treated literally)
- > - a > char
- [ - a [ char
- - a space.

So, it matches <U+0 in <U+0001F41C>, for example.

score 2 · Answer 3 · answered Apr 09 '22 at 18:31

We can add one or more (+) for the [[:alnum:]]

library(dplyr)
library(stringr)
user_profiles <- user_profiles %>%
  mutate(clean_name = str_remove_all(name, "\\s*\\<U\\+[[:alnum:]]+\\>\\s*"))

-output

user_profiles
                                      name         clean_name
1                                 Susanne Bold       Susanne Bold
2                 Julian K. Peard <U+0001F41C>    Julian K. Peard
3 <U+0001F30A> Alexander K Miller <U+0001F30A> Alexander K Miller
4                                   John Mason         John Mason

How to remove Unicode representations of Emojis in strings using regexp in R?

3 Answers3