I need help understanding how to handle JSON \u escapes where surrogate pairs are involved

Question

I've thrown myself into the deep end of the pool, so forgive me if i'm struggling a bit:

For background:

https://mathiasbynens.be/notes/javascript-encoding

I've been looking at the following two links and it leaves me with some questions:

it appears (based on my understanding) that you can represent a 32-bit codepoint using a surrogate pair of JSON escapes something like "\uD834\uDF06"

First question: Is that accurate? Is this how you represent a 32-bit unicode codepoint in JSON (i heard javascript engines are a bit weird because the spec predates utf-16 so they might not handle surrogates as one character? but I don't want to have to care about that. i hope i don't have to)

Second question: Assuming that's accurate, is it somehow possible to create a valid surrogate pair using one JSON escape and a couple of extended characters in the same string? Should I be able to handle that in my code? What I mean is if I encounter something like "\uD834��" where � is an arbitrary value, possibly in the extended character range should I fail due to an invalid surrogate pair, or should i treat the � characters as the second half of the pair? (my characters are one byte in my code i'm doing utf8 internally so the above two extended characters would be 16 bits total)

Does that even make sense? I'm not even sure I'm asking the right questions here so forgive me. I am very new at this.

I have to know this by the way, instead of using existing libraries and stuff because i'm targeting platforms including the Arduino with my JSON library and on that platform everything is roll your own.

Have you read the [JSON spec](http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf) for yourself? Section 9 covers string encoding and surrogates in detail. — Remy Lebeau, Dec 24 '20 at 20:03
I looked at it, and i think it's the wrong spec because there are RFCs that supercede the EMCA spec. Furthermore it says nothing about my question. It mentions surrogate pairs but it doesn't say how to handle the situation I described. Adding. I'm not sure but it could be that such a character sequence is impossible to represent with utf8. If that's the case I can see why I can't find anything about this scenario. — honey the codewitch, Dec 25 '20 at 21:13
The ECMA spec says: “*Any code point may be represented as a hexadecimal escape sequence... If the code point is in the [BMP] (...), then it may be represented as a **six-character sequence**... To escape a code point that is not in the [BMP], the character may be represented as a **twelve-character sequence**, encoding the UTF-16 surrogate pair corresponding to the code point...*” That is the only defined way to encode a codepoint in JSON using escape sequences. The example you provided does not conform to that description, so by definition it is malformed. — Remy Lebeau, Dec 25 '20 at 22:53
This is also mirrored in [RFC 8259 Section 7](https://tools.ietf.org/html/rfc8259#section-7). And [Section 8.2](https://tools.ietf.org/html/rfc8259#section-8.2) even mentions the possibility of unpaired surrogates, like in your example, which causes unpredictable behavior. — Remy Lebeau, Dec 25 '20 at 22:59
Sorry, I must have missed that in the spec. I had other things going on today. (it's xmas here) — honey the codewitch, Dec 26 '20 at 03:54

I need help understanding how to handle JSON \u escapes where surrogate pairs are involved

0 Answers0