0

I'm querying an API over HTTP I'm getting back JSON data with following

... Dv\\u016fr Kr\\u00e1lov\\u00e9 nad Labem a okol\\u00ed 5\\u00a0km ...". 

This is what I see when I open the same request in Firefox and show raw data and also when I try to println! the output in Rust.

I would like Rust to rather interpret these into proper chars. I've tried following function which I've googled and it works partially but it fails for some chars

    pub fn normalize(json: &str) -> core::result::Result<String, Box<dyn Error>> {
        let replaced : Cow<'_, str> = regex_replace_all!(r#"\\u(.{4})"#, json, |_, num: &str| {
            let num: u32 = u32::from_str_radix(num, 16).unwrap();
            let c: char = std::char::from_u32(num).unwrap();
            c.to_string()
        });
        Ok(replaced.to_string())
    }
Dvůr Králové nad Labem a okolí 5\u{a0}km

What's the proper way to handle such JSON data?

kangcz
  • 195
  • 2
  • 9
  • 3
    Are you using a library such as serde-json to parse the JSON? If so, I would expect it to take care of decoding the JSON escaped string into a proper UTF-8 Rust string. Note that if you're seeing `\\\` in Firefox, it could mean that the JSON has been badly encoded with duplicate escape characters. – SirDarius Jan 23 '22 at 15:42
  • The double backslashes could be an artifact of the debugger. If they are really doubled, fix the source of the JSON data. – Codo Jan 23 '22 at 15:49
  • 6
    I suspect this is an issue with how you're examining the data - not the data itself. In other words, you're looking at a `Debug` representation of the string, which escapes certain characters. [See the difference between `{}` and `{:?}` formatting of the same string](https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=86c9d0c252a2899df16522af9d458769). – trent Jan 23 '22 at 15:59
  • I suspect regexes are not enough for that. Take a look at [`rustc_lexer::unescape::unescape_str_or_byte_str()`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_lexer/unescape/fn.unescape_str_or_byte_str.html). – Chayim Friedman Jan 23 '22 at 20:59
  • 1
    Ah, you're right - serde parsing does solve it. I was trying to print the response as-is and was puzzle why it does not get interpreted correctly. – kangcz Jan 23 '22 at 21:36

1 Answers1

2

It appears you have a json-encoded string. A rust-encoded string for the same data would look like this:

    let s = "Dv\u{016}fr Kr\u{00e1}lov\u{00e9} nad Labem a okol\u{00ed} 5\u{00a0}km";

To covert a json-encoded string you can use serde, like this:

fn main() {
    let json_encoded = "Dv\\u016fr Kr\\u00e1lov\\u00e9 nad Labem a okol\\u00ed 5\\u00a0km";


    let result: Result<String, serde_json::Error> = serde_json::from_str(&format!("\"{}\"", json_encoded));

    match result {
      Err(e) => println!("oops: {}", e),
      Ok(s)  => println!("{}", s)
    }
}

output:

Dvůr Králové nad Labem a okolí 5 km

see playground

also, this related question might be useful: How to correctly parse JSON with Unicode escape sequences?

Ultrasaurus
  • 3,031
  • 2
  • 33
  • 52