0

I'm trying to extract data from a dump of a Paradox database. It contains two fields with rich text stored as a binary blob that I'm having troubles decoding. In the middle of the blob there is the plain text, but it is surrounded by two blocks of binary containing formatting information for the text whose length is varying. So far I could understand some of the structure, but it's not enough to reliably decode the whole block or at least figure out how long it is to skip to the next one.

What I have so far:

  • Ints inside the block are in Little Endian format
  • the blob starts with a sequence of 44 bytes
    • the first 4 bytes seem to be always 07 00 00 00
    • the following 4 bytes contain the length of the text in bytes
    • the purpose of the remaining 36 bytes isn't clear yet
  • then follows the unformatted text, whose length is given above
  • the remainder of the blob contains formatting information and has a variable length
    • no idea what the first 25 (or 26) bytes are for
    • they are followed by a series of formatting markers that look like this: A0 03 00 00 03 80. Their meaning is: starting at character 0x03A0, apply style number 03
    • then there are 3 (or 4) bytes specifying the number of styles
    • after that follow style descriptions. Each is 54 bytes long, the name of the font is visible there in plain text.
    • the block ends with 26 bytes of unknown purpose

A person who has experience with the Paradox file format told me that this rich text blob probably isn't Paradox-specific. Could it be a format that Windows is using to store data in Richedit fields? Does anybody else recognize something about the format?

elpres
  • 416
  • 5
  • 12
  • did you try this? https://sourceforge.net/projects/pxlib/files/pxview/0.2.5/ – Ferenc Deak Apr 15 '16 at 12:48
  • Yes, I used pxview to dump the file to CSV, but then the blobs are just stored in the CSV the same way they are in the original database. – elpres Apr 15 '16 at 12:56
  • The binary blobs could be the result of streaming the contents out of a Rich Edit Control (see [How to Use Streams](https://msdn.microsoft.com/en-us/library/windows/desktop/hh270405.aspx)). You could try sending an [EM_STREAMOUT](https://msdn.microsoft.com/en-us/library/windows/desktop/bb774304.aspx) to a Rich Edit Control in a test application, to see if it's similar to what you have. – IInspectable Apr 15 '16 at 12:58
  • 1
    Why are you trying to reverse engineer this binary format? Aren't there tools that can read it? – David Heffernan Apr 15 '16 at 13:39
  • @IInspectable You are probably right. Do you know if the format of the output is documented somewhere? The articles at MSDN show how to stream the content into a buffer, but that seems to be just to e.g. serialize it and then stream back into another Rich Edit later, no further info on the format, as far as I could see. – elpres Apr 15 '16 at 13:39
  • @DavidHeffernan If there are, I didn't find one yet. Not exactly sure what to look for either. – elpres Apr 15 '16 at 13:41
  • I doubt that the binary format is officially documented. Other than the fact that it was introduced with RichEdit 5.0, there is very little information available, with the exception of a high-level explanation on [Paragraphs and Paragraph Formatting](https://blogs.msdn.microsoft.com/murrays/2008/11/21/paragraphs-and-paragraph-formatting/). Instead of reverse-engineering the format you could use another approach: Have the RichEdit Control stream in the binary data and stream it out using a human-readable representation. – IInspectable Apr 15 '16 at 13:52
  • @IInspectable I thought `EM_STREAMOUT` emits RTF. – David Heffernan Apr 15 '16 at 14:00
  • @elpres Do you have any knowledge of how the data was created? – David Heffernan Apr 15 '16 at 14:00
  • @IInspectable Thank you. That approach makes sense and is probably the only workable option that's left. – elpres Apr 15 '16 at 14:06
  • @DavidHeffernan: You have to use the undocumented `SF_BINARY` (0x0008) value. This is explained at [Using RichEdit 6.0 for Math](https://blogs.msdn.microsoft.com/murrays/2007/10/28/using-richedit-6-0-for-math/). – IInspectable Apr 15 '16 at 14:11
  • @DavidHeffernan The data was entered into the database using a GUI written is Paradox itself, which is one of the most rudimentary UIs I've ever seen. The rich text blobs where written in Word and pasted into a text field. Whether Paradox somehow processed the pasted content before storing it or not, I don't know. In general, I've never worked with WinAPI and can only guess what happened to the text between copying it out of Word and its stored form in the database. – elpres Apr 15 '16 at 14:12
  • What @IInspectable is saying sounds quite plausible, especially now that you mention pasting. So I would have a go at what he says, `EM_STREAMOUT` with `SF_BINARY`. Or even paste out of Word, and inspect the contents of the clipboard's RTF format object. – David Heffernan Apr 15 '16 at 14:15

0 Answers0