3

I am trying to identify the file type of the files uploaded, after searching, I plan to use the python-magic to check the mime types of a file.

The FileField is used in my models, a ModelForm is used to help save the files.

After all the files have been uploaded, I check the mime type in my python shell

I find that using

magic.from_file("path_to_the_file", mime=True)

woud give the expected mime type for image,txt,pdf files that have been saved.

However, for all the docx, ppt, excel files, it would identify them as 'application/zip'

Can anyone explain why this is happening(the django auto save the ms files as zip??). And is there any good way to make the magic identify the docx, ppt, excel files as they are?

Thank you very much.

Mona
  • 1,425
  • 6
  • 21
  • 31

1 Answers1

4

I too came across this issue recently. Python-magic uses the Unix command file which uses a database file to identify documents (see man file). By default this database does not include instructions on how to identify .docx, .pptx, and .xlsx file types.

You can give additional information to file command to identify these types by adding instructions to /etc/magic (see https://serverfault.com/a/377792).

This should then work:

magic.from_file("path_to_the_file.docx", mime=True)

Returns 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'

One thing to note from the python-magic usage instruction on GitHub - this does not seem work for .docx, .pptx, and .xlsx file types (with the additional information in /etc/magic):

magic.from_buffer(open("testdata/test.pdf").read(1024), mime=True)

Returns 'application/zip'

It seems you need to give it more data to correctly identify these file types:

magic.from_buffer(open("testdata/test.pdf").read(2000), mime=True)

Returns 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'

I'm not sure of the exact amount needed.

Community
  • 1
  • 1
Sion
  • 161
  • 9
  • 1
    Your solution works for me as well. I've had a look in a hex editor and I can see that the bit that libmagic looks for is just before 1800 bytes in my test file, but there's no guarantee that's where it'll be - it's just a string in the zip file format. You can see how libmagic detects OO XML here: https://github.com/threatstack/libmagic/blob/master/magic/Magdir/msooxml I've decided to pass libmagic 2048 bytes for now. – Michael Mulqueen Jul 19 '17 at 13:49