0

1) How can I differentiate doc and docx files from requests?

a) For instance, if I have

url='https://www.iadb.org/Document.cfm?id=36943997'
r = requests.get(url,timeout=15)
print(r.headers['content-type'])

I get this:

application/vnd.openxmlformats-officedocument.wordprocessingml.document

This file is a docx.

b) If I have

url='https://www.iadb.org/Document.cfm?id=36943972'
r = requests.get(url,timeout=15)
print(r.headers['content-type'])

I get this

application/msword

This file is a doc.

2) Are there other options?

3) If I save a docx file as doc or vice-versa may I have recognition problems (for instance, for conversion to pdf?)? Is there any kind of best practice for dealing with this?

DanielTheRocketMan
  • 3,199
  • 5
  • 36
  • 65
  • 1
    Are you going to save the file anyway? In that case, look at the first two bytes. Note that it will not tell you if the document is indeed a valid DOC *or* a DOCX; it will only differentiate between these two. This will only work if you are absolutely sure that it will be *one of these two*. – Jongware Nov 18 '18 at 14:37
  • @usr2564301 Yes, I will download anyway if it is a doc or docx, but not a excel file for instance. How I do that? I mean I have to do this automatically! They are lots of files. – DanielTheRocketMan Nov 18 '18 at 14:39

1 Answers1

2

The mime headers you get appear to be the correct ones: What is a correct mime type for docx, pptx etc?

However, the sending software can only go on what file its user selected – and there still are a lot of people sending files with the wrong extension. Some software can handle this, others cannot. To see this in action, change the name of a PNG image to end with JPEG instead. I just did on my Mac and Preview still is able to open it. When I press ⌘+I in the Finder it says it is a JPEG file, but when opened in Preview it gets correctly identified as a "Portable Network Graphics" file. (Your OS may or may not be able to do this.)

But after the file is downloaded, you can unambiguously differ between a DOC and a DOCX file, even if the author got its extension wrong.

A DOC file starts with a Microsoft OLE Header, which is quite complicated structure. A DOCX file, on the other hand, is a compound file format containing lots of smaller XML files, compressed together using a standard ZIP file compression. Therefore, this file type always will start with the two characters PK.

This check is compatible with Python 2.7 and 3.x (only one needs the decode):

import sys

if len(sys.argv) == 2:
    print ('testing file: '+sys.argv[1])
    with open(sys.argv[1], 'rb') as testMe:
        startBytes = testMe.read(2).decode('latin1')
        print (startBytes)
        if startBytes == 'PK':
            print ('This is a DOCX document')
        else:
            print ('This is a DOC document')

Technically it will confidently state "This is a DOC document" for anything that does not start with PK, and, conversely, it will say "This is a DOCX document" for any zipped file (or even a plain text file that happens to start with those two characters). So if you further process the file based on this decision, you may find out it's not a Microsoft Word document after all. But at least you will have tried with the proper decoder.

Jongware
  • 22,200
  • 8
  • 54
  • 100