3

I have a big data set which I cleaned up and found that one of the fields has value like

"My son is turning into a monster \xf0\u009f\u0098\u0092"

I am not able to create this simple data as it gives the below mentioned error

a <- c('My son is turning into a monster \xf0\u009f\u0098\u0092')

Error: mixing Unicode and octal/hex escapes in a string is not allowed

Now suppose I have this value in my variable and want to remove all non-ascii characters like

library(stringi)
b <- stri_trans_general(a, "latin-ascii")

and now want to converted text in the lower format

tolower(b)

I am getting below mentioned error

Error in tolower(b) : invalid input 'My son is turning into a monster 😒' in 'utf8towcs'

Can someone please help me to understand the issue

Dhrutika Rathod
  • 540
  • 2
  • 6
  • 22
Vineet
  • 1,492
  • 4
  • 17
  • 31

2 Answers2

3

You can use iconv to remove non-ASCII characters:

a <- c('My son is turning into a monster \xf0\x9f\x98\x92')
a
[1] "My son is turning into a monster 😒"
iconv(a,to="ASCII",sub="")
[1] "My son is turning into a monster "
James
  • 65,548
  • 14
  • 155
  • 193
  • thanks its working for me, I was not sure how my text become mixture of octal and normal text – Vineet Dec 19 '17 at 14:14
2

To remove all non-ASCII characters you can use regex. [\x00-\x7F] is the set of all non-ASCII characters, so we can replace every occurrence with nothing. However, R doesn't like \x00 because it's the null character, so I had to modify the series to be [\x01-\x7F]

a <- c('My son is turning into a monster \u009f\u0098\u0092')
#> [1] "My son is turning into a monster \u009f\u0098\u0092"
tolower(gsub('[^\x01-\x7F]+','',a))
#> [1] "my son is turning into a monster "

or, with the octal codes

a <- c('My son is turning into a monster \xf0')
#> [1] "My son is turning into a monster ð"
tolower(gsub('[^\x01-\x7F]+','',a))
#> [1] "my son is turning into a monster "
Mark
  • 4,387
  • 2
  • 28
  • 48
  • thanks, its working for me ...I am not sure how my text will become mix of octal and other characters – Vineet Dec 19 '17 at 14:13
  • It looks like you have comments from the web - it's probably a mix of emoji and emoji-like characters that are poorly converted by R. – Mark Dec 19 '17 at 14:14
  • true, its from twitter so kind of mixture – Vineet Dec 19 '17 at 14:17