Parsing unicode with values > 32767 from RTF files

Question

I'm working on an RTF parser and having some difficulty handling unicode.

The RTF spec states that "Unicode values greater than 32767 must be expressed as negative numbers" (http://www.biblioscape.com/rtf15_spec.htm#Heading9), and to get the unicode numerical value we add 65536 to those negative numbers.

I was testing that scenario by setting up a document with unicode character 32767 and 32768. Word (v2011 on Mac) produces the following RTF syntax for those 2 characters:

\u32767\'5f\loch\af556\hich\af31506\dbch\f556 \uc2\u-32768\'97\'73

For the second one, -32768+65536 is 32768 as expected. So the \uNNNN commands make sense.

My problem is with the text escape sequences, like the \'97\'73 at the end. I don't understand why that's there. I could code my parser to ignore commands that are chained onto the end of a \uNNNN command like that. But I compared with the RTF output of TextEdit, and it only outputs the text escape sequences:

\uc0\u32767 \'97\'73

It seems like that's trying to be a double byte unicode escape sequence. And that kind of \' text escape is in hexadecimal. But 0x9773 is 38771, not 32768, so I don't understand how I can extract the desired unicode value from that data. Any ideas?

Update: I ran some further tests to look at how TextEdit handles character codes 32767 - 32777. They look like this in RTF:

\u32767 
\'97\'73
\'98\'56
\u32770 
\'8d\'6c
\'e3\'cc
\'8e\'d2
\'e3\'cb
\u32775 
\u32776 
\'c2\'56

This RTF will load properly in both TextEdit and Word, so clearly it's valid. I just don't see a pattern here.

score 0 · Answer 1 · answered Nov 09 '17 at 00:06

The \u tag in RTF is followed by the number of fallback characters to represent the Unicode characters using an ASCII or multibyte character set. This is for backward compatibility with old RTF reader that don't support the \u tag. The number of fallback character needed to represent a Unicode character is specified by the \uc tag. The Unicode character for multibyte characters-set need multiple fallback characters to represent it, hence a \uc tag value of 2 or more. Most modern RTF readers would simply ignore the fallback characters, so their values are not significant anymore. I hope this answers your question regarding the pattern.

Parsing unicode with values > 32767 from RTF files

1 Answers1