4

I am working on an RTF file made by someone else on an unknown platform, and everything is interpreted correctly, except some characters, whatever character set I open them from in openoffice. Here is the plain text, after interpretation:

"Même taille que la Terre, même masse, même âgec Vénus a souvent été qualifiée de sœur de la Terre. "

and here is the original ANSI paragraph:

"M\u234\'3fme taille que la Terre, m\u234\'3fme masse, m\u234\'3fme \u226\'3fge\uc2 \u61825\'ff\'81\uc1 c V\u233\'3fnus a souvent \u233\'3ft\u233\'3f qualifi\u233\'3fe de s\u339\'3fur de la Terre."

To zoom in:

"âgec Vénus" becomes "\u226\'3fge\uc2 \u61825\'ff\'81\uc1 c V\u233\'3fnus"

and finally, what we come up with:

"\uc2 \u61825\'ff\'81\uc1 c"

here \uc2 and \uc1 are to say we are going back and forth between 4-bytes and 2-bytes Unicode encoding.

\u61825 is an unknown Unicode character. Indeed, according to the RTF specification, any UTF character greater than 2^15 should be written in a negative form; negative form with ANSI characters should make the "-" (minus) sign visible to the notepad, am I right? So here already I have something I don't understand, how the RTF writer used by the person who made the rtf file in the first place could have done it. Maybe I missed something in the specification, specific versions, character sets, I don't know. If taken as is, 61825 would correspond to F181 which is in a private area of the Unicode table.

And then, the \'ff\'81 would be some use of the ANSI equivalent field of the whole "specific character" group (whose structure is usually \uN\'XX), to code something that would be 4-byte long. And here again, I could not find:

  • what is the code page (Windows-1252, ISO-8859-1, other?) being refered to (as in all the other places in the file where a \uN\'XX sequence apears, XX are always 3F, the Windows-1252 code for "?", so it did not give me much information)

  • what does the \'FF (which looks like some control character inside an escape sequence!) stand for, and then why \'81... Actually, the translation of \u61825 to hex is F181, not FF81...I am lost here!

Finally, what the translated text (in French) would make us expect is the ":" (semicolon): "Same size as Earth, same mass, same age: Venus has often been qualified as Earth's sister". It would make sense. But what rtf writer could imagine such a complicated code for the semicolon?

So again, after 1 hour of search, I open the question to you fellows: does someone recognize this, and could tell me what control word encoding is used, is there a big endian/little endian/2's complement mess here with the 61825, and same with the \'ff\'81, which would assemble as FF81 instead of F181, which itself doesn't mean anything as is...here my question is only to know if there would be a way to find the complete original text back from the bizarre RTF encoding!

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
MrBrody
  • 301
  • 2
  • 13
  • I would advise a small edit to this post: Give us a byte level dump of the section of the file, rather than trying to interpret it as unicode characters. I.e, something like "2C 81 FF". – Simon Callan Apr 12 '12 at 12:10
  • You're right, here is the hex dump of the "\uc2 \u61825\'ff\'81\uc1 c": 5C 75 63 32 20 5C 75 36 31 38 32 35 5C 27 66 66 5C 27 38 31 5C 75 63 31 20 63 -- exactly what it should be! – MrBrody Apr 12 '12 at 16:49

1 Answers1

2

what the translated text (in french) would make us expect is the ":" (semicolon

Nearly: it should be the ellipsis. You can see the source text eg here.

The ellipsis should normally be written simply as three periods, but there has traditionally been a separate character representing ellipsis in order better to control their spacing, back before complex text layout algorithms existed that could do automatic glyph replacement. Consequently there exists a Unicode compatibility character U+2026 HORIZONTAL ELLIPSIS to allow round-tripping to legacy encodings such as Windows code page 1252, where it is byte 133.

That, however, is not what has been encoded in your RTF document. That would be too easy.

61825 is an unknown Unicode character.

It's a Private Use Area character, which means it could represent absolutely anything. Word has exported certain common symbol fonts as PUA characters - see this post for the background.

So someone at some point may have used a symbol font where code unit 129 (the 0x81 in U+F181, 61825) maps to something that looks like an ellipsis. Quite what that font is, I have no idea! It doesn't seem to be one of the usual suspects (Symbol, Wingdings, Webdings). You might just have to manually replace U+F181 with U+2026 for now unless you can find out more about the source.

bobince
  • 528,062
  • 107
  • 651
  • 834
  • whao, I didn't think someone would find anything about that...I didn't know anything about all you said here, you went a lot deeper than I did! Thank you, I think you're right, it must be some very weird way to mention the ellipsis, through some police transmitted to the PUA... – MrBrody May 04 '12 at 03:56