Extracting different content-type of MHT file into multiple mht file

Question

I am writing an mht script to parse an mht file and extract the part message from the parent and write them to a separate mht file

I wrote the below function which opens a mht file at file_location and searches for specific content_id and writes it to a new mht file

def extract_content(self, file_location, content_id,extension):
    first_part = file_location.split(extension)[0]
    #checking if file exists
    new_file = first_part + "-" + content_id.split('.')[0] + extension

    while os.path.exists(new_file):
        os.remove(new_file)

    with open(file_location, 'rb') as mime_file, open(new_file, 'w') as output:
        ***#Extracting the message from the mht file***
        message = message_from_file(mime_file)
        t = mimetypes.guess_type(file_location)[0]

        #Walking through the message
        for i, part in enumerate(message.walk()):

            #Check the content_id if the one we are looking for
            if part['Content-ID'] == '<' + content_id + '>':
                ***witing the contents***
                output.write(part.as_string(unixfrom=False))

Apparently I am not able to open the output parts in IE in the case of application/pdf and application/octet-stream.

How do I write these Content-Type like application/pdf and application/octet-stream in to mht files so that I am able to view the image or pdf in IE?

Thanks

@m170897017 Thanks for your comment. It does not display any error, but displays a blank page — karthikbharadwaj, Aug 29 '14 at 07:01

Stephen Lin · Answer 1 · 2014-08-29T07:21:22.973

1

Try this:

...
if m['Content-type'].startswith('text/'):
                    m["Content-Transfer-Encoding"] = "quoted-printable"

                else:
                    m["Content-Transfer-Encoding"] = "base64"

                m.set_payload(part.get_payload())                        
                ****Writing to output****
                info = part.as_string(unixfrom=False)
                info = info.replace('application/octet-stream', 'text/plain')
                output.write(info)
...

Tell me if it works.

edited Aug 29 '14 at 07:21

answered Aug 29 '14 at 07:05

Stephen Lin

4,852
1
13
26

Thanks for your answer. I am able to print the file as well as open it in a text editor. But I want IE to recognize it as proper .mht file, which it is not doing. I guess it is some format issue, but I am not able to point out exactly. – karthikbharadwaj Aug 29 '14 at 07:11
@karthikbharadwaj I think change header might help. I've updated my answer. Give it a shot and tell me. – Stephen Lin Aug 29 '14 at 07:22
Thanks a lot. Works fine. I think I have trouble with the content-types application/octet-stream, application/pdf which are basically jpg and pdf attachments. Is there a way to encode them appropriately so that they appear in the new mht file?. I am also looking around playing with the code. If you have some knowledge on this it would be helpful. Thanks. – karthikbharadwaj Aug 29 '14 at 08:28
@karthikbharadwaj Aha, that's a big question. I'm afraid I cannot help you with this since I don't have such kind of experience. I suggest you raise another question about this and there must be someone who have answers. BTW, don't forget to accept it as an answer. Thanks! – Stephen Lin Aug 29 '14 at 08:35
@karthikbharadwaj A new question will help you gain more attention than this way. But if you insist, so be it. – Stephen Lin Sep 02 '14 at 04:07

Extracting different content-type of MHT file into multiple mht file

1 Answers1

Linked