0

I am working on cleaning data containing both Cantonese and emoji. In UTF-8 Format, some of them will be converted into Unicode that has to clean away. These Unicode look like:

攬炒巴我愛你<U+2764><U+FE0F><U+0001F618><U+0001F618><U+0001F618><U+0001F618>">

As you can see, these Unicode contain different length, which sabotage my traditional way of cleaning these Unicode:

basedf$post_cleaned = str_replace(basedf$post_cleaned,"smile_.*:","")
basedf$post_cleaned = str_replace(basedf$post_cleaned,"\\[quote\\].*\\[\\/quote\\]","")
basedf$post_cleaned = str_replace(basedf$post_cleaned,"\\[.+\\].*\\[\\/.+\\]","")
basedf$post_cleaned = gsub("[^[:alnum:][:blank:]?&/\\-]", "", basedf$post_cleaned)
basedf$post_cleaned = gsub("/", "", basedf$post_cleaned)
basedf$post_cleaned = gsub("\\\\", "", basedf$post_cleaned)
basedf$post_cleaned = gsub("U....", "", basedf$post_cleaned)
basedf$post_cleaned = gsub("[1234567890]", "", basedf$post_cleaned)
basedf$post_cleaned = gsub("[.]", "", basedf$post_cleaned)
basedf$post_cleaned = rm_url(basedf$post_cleaned)
basedf$post_cleaned = str_trim(basedf$post_cleaned)

As you can see, I cleaned Unicode at the line basedf$post_cleaned = gsub("U....", "", basedf$post_cleaned)after cleaning all the <>+.

However, it leaves two to four letters behind when it encounters an 8 digit or 10 digits Unicode, and it messed my data. For example, sometimes it left FB in Unicode like <U+0001F6B8>(I made it up as example), I will interpret FB as Facebook in my data.

Do you have any suggestions on cleaning Unicode with different lengths? This problem has bugged me for two days and heavily dragged by progress.

Sorry for my bad English and coding. I am a newbie in coding.

Thank you.

  • dirty way: add a `basedf$post_cleaned = gsub("U........", "", basedf$post_cleaned)` before the substitution which cause you problems. but you may just use `"U[0123456789]*"`. You have a lot of options. [Note: your problem is not about Unicode, but about regular expressions] – Giacomo Catenazzi Sep 16 '20 at 09:03
  • But the unicode is constituted with both letters and number, should I use a combination of it? Furthermore, I fear that if I just use `gsub("U.........") `, when it substitutes a shorter unicode, it will also delete the data after it. – Po Sang Yu Sep 16 '20 at 10:47
  • Right, you can add `ABCDEFabcdef`. The version with 8 dots should be used before the version with 4 dots, but you need both. But you are right, and it seems something wrong. If you have the word `Unicode` on your text, it will be replaced. – Giacomo Catenazzi Sep 16 '20 at 12:08
  • [Take a look here](https://stackoverflow.com/q/62126926/3439404) – JosefZ Sep 16 '20 at 17:17

1 Answers1

0

Thanks to JosefZ, this is my answer after a few trials:

basedf$title_cleaned = basedf$title%>% stri_replace_all_regex("<U\\+([[:alnum:]]+)>", "\\\\u$1") %>% stri_unescape_unicode()