I'm working on an RTF parser and having some difficulty handling unicode.
The RTF spec states that "Unicode values greater than 32767 must be expressed as negative numbers" (http://www.biblioscape.com/rtf15_spec.htm#Heading9), and to get the unicode numerical value we add 65536 to those negative numbers.
I was testing that scenario by setting up a document with unicode character 32767 and 32768. Word (v2011 on Mac) produces the following RTF syntax for those 2 characters:
\u32767\'5f\loch\af556\hich\af31506\dbch\f556 \uc2\u-32768\'97\'73
For the second one, -32768+65536 is 32768 as expected. So the \uNNNN commands make sense.
My problem is with the text escape sequences, like the \'97\'73 at the end. I don't understand why that's there. I could code my parser to ignore commands that are chained onto the end of a \uNNNN command like that. But I compared with the RTF output of TextEdit, and it only outputs the text escape sequences:
\uc0\u32767 \'97\'73
It seems like that's trying to be a double byte unicode escape sequence. And that kind of \' text escape is in hexadecimal. But 0x9773 is 38771, not 32768, so I don't understand how I can extract the desired unicode value from that data. Any ideas?
Update: I ran some further tests to look at how TextEdit handles character codes 32767 - 32777. They look like this in RTF:
\u32767
\'97\'73
\'98\'56
\u32770
\'8d\'6c
\'e3\'cc
\'8e\'d2
\'e3\'cb
\u32775
\u32776
\'c2\'56
This RTF will load properly in both TextEdit and Word, so clearly it's valid. I just don't see a pattern here.