I am working on cleaning data containing both Cantonese and emoji. In UTF-8 Format, some of them will be converted into Unicode that has to clean away. These Unicode look like:
攬炒巴我愛你<U+2764><U+FE0F><U+0001F618><U+0001F618><U+0001F618><U+0001F618>">
As you can see, these Unicode contain different length, which sabotage my traditional way of cleaning these Unicode:
basedf$post_cleaned = str_replace(basedf$post_cleaned,"smile_.*:","")
basedf$post_cleaned = str_replace(basedf$post_cleaned,"\\[quote\\].*\\[\\/quote\\]","")
basedf$post_cleaned = str_replace(basedf$post_cleaned,"\\[.+\\].*\\[\\/.+\\]","")
basedf$post_cleaned = gsub("[^[:alnum:][:blank:]?&/\\-]", "", basedf$post_cleaned)
basedf$post_cleaned = gsub("/", "", basedf$post_cleaned)
basedf$post_cleaned = gsub("\\\\", "", basedf$post_cleaned)
basedf$post_cleaned = gsub("U....", "", basedf$post_cleaned)
basedf$post_cleaned = gsub("[1234567890]", "", basedf$post_cleaned)
basedf$post_cleaned = gsub("[.]", "", basedf$post_cleaned)
basedf$post_cleaned = rm_url(basedf$post_cleaned)
basedf$post_cleaned = str_trim(basedf$post_cleaned)
As you can see, I cleaned Unicode at the line
basedf$post_cleaned = gsub("U....", "", basedf$post_cleaned)
after cleaning all the <>+.
However, it leaves two to four letters behind when it encounters an 8 digit or 10 digits Unicode, and it messed my data. For example, sometimes it left FB in Unicode like <U+0001F6B8>(I made it up as example), I will interpret FB as Facebook in my data.
Do you have any suggestions on cleaning Unicode with different lengths? This problem has bugged me for two days and heavily dragged by progress.
Sorry for my bad English and coding. I am a newbie in coding.
Thank you.