doc or docx: Is there safeway to identify the type from 'requests' in python3?

Question

1) How can I differentiate doc and docx files from requests?

a) For instance, if I have

url='https://www.iadb.org/Document.cfm?id=36943997'
r = requests.get(url,timeout=15)
print(r.headers['content-type'])

I get this:

application/vnd.openxmlformats-officedocument.wordprocessingml.document

This file is a docx.

b) If I have

url='https://www.iadb.org/Document.cfm?id=36943972'
r = requests.get(url,timeout=15)
print(r.headers['content-type'])

I get this

application/msword

This file is a doc.

2) Are there other options?

3) If I save a docx file as doc or vice-versa may I have recognition problems (for instance, for conversion to pdf?)? Is there any kind of best practice for dealing with this?

Are you going to save the file anyway? In that case, look at the first two bytes. Note that it will not tell you if the document is indeed a valid DOC *or* a DOCX; it will only differentiate between these two. This will only work if you are absolutely sure that it will be *one of these two*. — Jongware, Nov 18 '18 at 14:37
@usr2564301 Yes, I will download anyway if it is a doc or docx, but not a excel file for instance. How I do that? I mean I have to do this automatically! They are lots of files. — DanielTheRocketMan, Nov 18 '18 at 14:39

score 2 · Accepted Answer · answered Nov 18 '18 at 15:12

The mime headers you get appear to be the correct ones: What is a correct mime type for docx, pptx etc?

However, the sending software can only go on what file its user selected – and there still are a lot of people sending files with the wrong extension. Some software can handle this, others cannot. To see this in action, change the name of a PNG image to end with JPEG instead. I just did on my Mac and Preview still is able to open it. When I press ⌘+I in the Finder it says it is a JPEG file, but when opened in Preview it gets correctly identified as a "Portable Network Graphics" file. (Your OS may or may not be able to do this.)

But after the file is downloaded, you can unambiguously differ between a DOC and a DOCX file, even if the author got its extension wrong.

A DOC file starts with a Microsoft OLE Header, which is quite complicated structure. A DOCX file, on the other hand, is a compound file format containing lots of smaller XML files, compressed together using a standard ZIP file compression. Therefore, this file type always will start with the two characters PK.

This check is compatible with Python 2.7 and 3.x (only one needs the decode):

import sys

if len(sys.argv) == 2:
    print ('testing file: '+sys.argv[1])
    with open(sys.argv[1], 'rb') as testMe:
        startBytes = testMe.read(2).decode('latin1')
        print (startBytes)
        if startBytes == 'PK':
            print ('This is a DOCX document')
        else:
            print ('This is a DOC document')

Technically it will confidently state "This is a DOC document" for anything that does not start with PK, and, conversely, it will say "This is a DOCX document" for any zipped file (or even a plain text file that happens to start with those two characters). So if you further process the file based on this decision, you may find out it's not a Microsoft Word document after all. But at least you will have tried with the proper decoder.

doc or docx: Is there safeway to identify the type from 'requests' in python3?

1 Answers1