I'm having trouble handling escaped unicode characters in R, specifically those encountered when grabbing information from the MediaWiki API. I would find a JSON string like
{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}
Which should be perfectly valid but when read in through fromJSON()
I get:
snip...
[1] "Banach\023Tarski paradox"
Initially I thought this was just a problem with RJSONIO, but I encounter similar problems with scan()
and readLines()
. My guess is that I am missing something very basic.
I can't actually give a completely reproducible example using only R because if I send "em\u2013dash" to a file through write() (or some equivalent function) R will automatically convert the em dash. So here goes. Create a text file named test1 with the following:
"em\u2013dash" "em–dash" " em \u2013 dash"
Then load up R (for whatever the file path is):
> scan( file = "~/R/test1", what = "character", encoding = "UTF-8")
Read 3 items
[1] "em\\u2013dash" "em–dash" " em \\u2013 dash"
> readLines("~/R/test1", warn = FALSE, encoding = "UTF-8")
[1] "\"em\\u2013dash\" \"em–dash\" \" em \\u2013 dash\""
The added escape character is what causes my problems with fromJSON()
. I could just strip them out but I'd probably break something else in the process and I imagine there is an easier solution. Thanks.
Here's the session info:
R version 2.14.1 (2011-12-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] C/en_US.UTF-8/C/C/C/C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RJSONIO_0.98-0
loaded via a namespace (and not attached):
[1] tools_2.14.1