1

I'm currently putting together some code to extract a variety of files that are embedded in a Word document using Python, but I'm having particular trouble figuring out how to restore an embedded Outlook .msg file back to its original (usable) .msg form after extracting it as an oleObject.bin file. Does anyone have an idea how to do this?

It's pretty straight forward to restore PDF files and the zipfile library has built in tools to deal with zip files in .bin form, but I'm really scratching my head on these .msg files. I can't find a way to carve out the original file from all the added binary data. Any help or thoughts on this would be appreciated!

I essentially want to do the same thing as this question but for .msg files instead of PDFs: How can I decode a .bin into a .pdf

Edit: This is the error I get when I try to just rename the file extension of the .bin to .msg

Adhoc74
  • 13
  • 3
  • Please provide enough code so others can better understand or reproduce the problem. – Community Jun 09 '23 at 07:12
  • If you’re talking about just renaming the extension from .bin to .msg (or.whatever), that doesn’t work in this case. The file is still openable in Outlook. The .bin file is about 10kb larger, so the file needs to processing first. – Adhoc74 Jun 09 '23 at 20:46

2 Answers2

0

OLE Objects, If correctly embedded (not linked) are simply all the same as their source. So you can run them in their application and save them from that application. Thus the text will save in Notepad. The Zip will not need save as its a folder thus simply needs MOVE from its temporary location. And for a MSG it will be saveable from Outlook if you trust it to open.

enter image description here

If you don't have Outlook it can open in NotePad too (but will only be salvageable as plain text AND RTF if included). Here we see the Fax Sample entry from Me to You with complimentary message Hello World!

enter image description here
If we save the RTF we can see the RTF body content in WordPad (and thus auto-print to PDF using Write /PT ....)
enter image description here

If you want to pull all the bins use TAR -xf to unpack the .docX

hello - docx.zip\word\embeddings enter image description here

These will include (as you observed) from another question, headings and trailers. Of course you will not know which is which, without look inside and remove the header/trailer but a Zip will start with PK
enter image description here

A .MSG will start with the DOC signature
enter image description here

The start of a MSG file will be marked with ÐÏ à
which in hex should be something like D0 cF 11 e0 i.e its a "DocFile"

the end of a msg has 16 bit FEFF FFFF ... padding so ends say
þÿÿÿýÿÿÿÿÿÿÿÿ ...lots more ÿÿ ... ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
The bin has more data so the end of that block is dirty with 16bit filename and path
ÿÿÿÿÿÿÿÿT C : \ U s e r s \ n a m e \ A p p D a t a \ L o c a l \ T e m p \ { A 0 9 5 A 1 6 4 - 2 B 3 6 - 4 9 0 5 - A 2 9 4 - E 5 B C C B 9 5 B 9 B 5 } \ H e l l o ( 2 ) . m s g H e l l o . m s g C : \ U s e r s \ n a m e \ D o c u m e n t s \ H e l l o . m s g

unsure if the T is significant in some cases or just buffer debris so you need to check.

K J
  • 8,045
  • 3
  • 14
  • 36
0

To close this out, as KJ stated, the actual .msg file content in the .bin file will start with the bytes \xd0\xcf\x11\xe0 (specifically the second instance of that sequence of bytes).

I did some testing, and it looks like the footer padding added by the .bin file at the end begins with [SomeRandomByte]\x00\x00\x00C\x00:. The first byte of that sequence appears to be variable, so I just delete it after removing everything else.

I was able to find the contents by starting with the second \xd0\xcf\x11\xe0 sequence and ending by chopping off everything after and including the [SomeRandomByte]\x00\x00\x00C\x00: sequence.

Adhoc74
  • 13
  • 3