2

I am using R's RJSONIO to read json from a file. The json contains unicode characters, which get read incorrectly.

The code works when the json is passed as string as shown by the author of the R package in the question on stackoverflow How to correctly deal with escaped Unicode Characters in R e.g. the em dash (—).

However when the json is read from a file, it does not produce the correct unicode representation. As seen below:

fromJSON(content="~/MTS/temp")
$query
$query$categorymembers
$query$categorymembers[[1]]
$query$categorymembers[[1]]$ns
[1] 0
$query$categorymembers[[1]]$title
[1] "Banach\023Tarski paradox"

Where ~/MTS/temp contains:

{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}`
Community
  • 1
  • 1
  • What OS and R version are you running? I tried on Windows with R 3.1.1 with `fromJSON(content='{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}')` and it returned `[1] "Banach–Tarski paradox"` just fine. Are you saying the file literally has an `\u` in it? How did you create such a JSON file? – MrFlick Jun 01 '15 at 18:33
  • When you copy-paste this json ({"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}) with the \u2013 into a file and read from a file using fromJSON(content=) do you get the \023 or do you get the em-dash? I am using Mac OSx 10.9 and R version 3.2. – Lakshmi Ramachandran Jun 01 '15 at 20:27
  • Well, having `\u` in a JSON file isn't valid (if you want the subsequent number to be considered as unicode character). How are you creating an invalid JOSN file in the first place? – MrFlick Jun 01 '15 at 20:29
  • Well according to http://json.org/ the json can contain \u. And the library 'rjson' does the right thing reading the same file containing json with \u in it. – Lakshmi Ramachandran Jun 01 '15 at 20:32
  • If you feel the `rsjsonio` library has a parsing bug, you can contact the maintainer (`maintainer("rjsonio")`) and report it. If the `rjson` library works the way you want, then use that one. – MrFlick Jun 01 '15 at 20:38
  • Thanks! I did write the author Duncan Temple Lang but haven't heard from him. :) I could use rjson, but the code I am writing is a part of a larger application, and RJSONIO and not rjson was chosen for a specific reason. However, it does have this bug, which I was hoping someone on here might have encountered and figured how to solve. – Lakshmi Ramachandran Jun 01 '15 at 20:44

1 Answers1

1

An alternative package called jsonlite works the way you would expect on my system (OS X) -- but I did verify that RJSONIO does not. This is after I saved your JSON snippet to a file called utext.txt:

file.show("utext.txt")
## {"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}
jsonlite::fromJSON("~/temp/utext.txt")
## $query
## $query$categorymembers
##   ns                 title
## 1  0 Banach–Tarski paradox

Here is another solution that is a bit more platform-dependent: Encode your Unicode escaped files prior to reading them. (Whether or not your platform has this utility, I do not know, but even for Windows you can probably find it.)

My system locale encoding is UTF-8 (OS X standard), so when I run the command line utility native2ascii I can encode it as UTF-8, and then read it into R, where my locale is set to en_GB.UTF-8.

From a Terminal/shell:

native2ascii -reverse ~/temp/utext.txt ~/temp/utextUTF8.txt

Then in R:

RJSONIO::fromJSON("~/temp/utextUTF8.txt")
## $query
## $query$categorymembers
## $query$categorymembers[[1]]
## $query$categorymembers[[1]]$ns
## [1] 0
## 
## $query$categorymembers[[1]]$title
## [1] "Banach–Tarski paradox"

Voil\u00e0 problem solved.

Ken Benoit
  • 14,454
  • 27
  • 50