2

By entering the file with its extension, my code succeeds to detect the type of the file from the "magic number".

magic_numbers = {'png': bytes([0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A]),
                 'jpg': bytes([0xFF, 0xD8, 0xFF, 0xE0]),
                 #*********************#
                 'doc': bytes([0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1]),
                 'xls': bytes([0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1]),
                 'ppt': bytes([0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1]),
                 #*********************#
                 'docx': bytes([0x50, 0x4B, 0x03, 0x04, 0x14, 0x00, 0x06, 0x00]),
                 'xlsx': bytes([0x50, 0x4B, 0x03, 0x04, 0x14, 0x00, 0x06, 0x00]),
                 'pptx': bytes([0x50, 0x4B, 0x03, 0x04, 0x14, 0x00, 0x06, 0x00]),
                 #*********************#
                 'pdf': bytes([0x25, 0x50, 0x44, 0x46]),
                 #*********************#
                 'dll': bytes([0x4D, 0x5A, 0x90, 0x00]),
                 'exe': bytes([0x4D, 0x5A]),

                 }

max_read_size = max(len(m) for m in magic_numbers.values()) 
 
with open('file.pdf', 'rb') as fd:
    file_head = fd.read(max_read_size)
 
if file_head.startswith(magic_numbers['pdf']):
    print("It's a PDF File")
else:
    print("It's not a PDF file")

I want to know how I can modify it without specifying this part of code, i.e. once I generate or I enter the file it shows me directly the type of the file.

if file_head.startswith(magic_numbers['pdf']):
    print("It's a PDF File")
else:
    print("It's not a PDF file")

I hope you understand me.

Alya Mad
  • 23
  • 1
  • 5
  • so, you want to examine the first few bytes without reading in the first few bytes? – Garr Godfrey Oct 13 '21 at 19:47
  • Thanks for your answer, I want from the list of "magic numbers" I entered, it reads the contents and it detects from this list the type of file. Without specifying "if it's PDF, you show me PDF" – Alya Mad Oct 13 '21 at 20:41
  • yes, so you need to loop through all the elements and test each. A for loop gives you each key – Garr Godfrey Oct 13 '21 at 20:59
  • Honestly, I don't know how to do it. – Alya Mad Oct 13 '21 at 21:03
  • Be careful, `0xFF, 0xD8, 0xFF, 0xE0` is not the only JPG magic number. There is also - `0xFF 0xD8 0xFF 0xDB`, `0xFF 0xD8 0xFF 0xEE`, `0xFF 0xD8 0xFF 0xE1` and possibly more. Some only check the first 3 bytes. – Coder12345 Apr 02 '23 at 22:40

1 Answers1

0

You most like just want to iterate over the loop and test them all.

You may be able to optimize or provide some error checking by using the extension as well. If you strip off the extension and check that first, you'll be successful most of the time, and if not you may not want to accept "baby.png" as an xlsx file. That would be suspicious and worthy of an error.

But, if you ignore extension, just loop over the entries:

for ext in magic_numbers:
    if file_head.startswith(magic_numbers[ext]):
        print("It's a {} File".format(ext))

You probably want to put this in a function that returns the type, so you could just return the type instead of printing it out.

EDIT Since some share magic numbers, we need to assume the extension is correct until we know that it isn't. I would extract the extension from the filename. This could be done with Pathlib or just string split:

ext = filename.rsplit('.', 1)[-1]

then test it specifically

if ext in magic_numbers:
    if file_head.startswith(magic_numbers[ext]):
        return ext

put the ext test first, so putting it all together:

ext = filename.rsplit('.', 1)[-1]
if ext in magic_numbers:
    if file_head.startswith(magic_numbers[ext]):
        return ext

for ext in magic_numbers:
    if file_head.startswith(magic_numbers[ext]):
        return ext

return nil
Garr Godfrey
  • 8,257
  • 2
  • 25
  • 23
  • Thank you very much for your answer, that's exactly what I want to do. About removing the extension, it returns exactly what I want, i.e. I know the type of the file, PDF for example, and when I removed the extension, it shows me that it is a PDF file (you can test it). A little problem is that we know that .docx, xlsx and pptx have the same "magic_number". How can I differentiate between each one of them? Because When i run with a ".docx" file it shows me : * It's a docx File * It's a xlsx File * It's a pptx File – Alya Mad Oct 13 '21 at 21:24
  • The same thing for the other extensions who have the same magic numbers. – Alya Mad Oct 13 '21 at 21:26
  • if they have the same magic numbers, you would need to fall back to the extension. Probably following my suggestion to check the file extension first. – Garr Godfrey Oct 13 '21 at 21:29
  • If you really want to be able to differentiate various types of office documents without relying on the extension (or how to tell them apart from `.zip` files), you would need to implement further checks when you detect that magic number based on the [Office Open XML standard](https://www.ecma-international.org/publications-and-standards/standards/ecma-376/). – Da Chucky Oct 13 '21 at 21:57
  • @Garr Godfrey, thank you so much for the suggestion, but i didn't understand :( i try it with my code but it didn't work. – Alya Mad Oct 13 '21 at 22:13
  • @Da Chucky, Thanks for your suggestion. if i understand you, i must see the content of xml file of each type (docx, xlsx and pptx) and see what i can extract? But I don't know how to code it. – Alya Mad Oct 13 '21 at 22:16
  • @K J, you are right. But I want to see this way by modifying this code if it will work of course. – Alya Mad Oct 13 '21 at 22:19
  • @KJ I know if you change the extension of .docx, .pptx, .xlsx to .ZIP you can see the content of the [Content_type].xml of each file wanted, - if it is a .docx you will find "". - if it's a .PPTX you will find "" So far so good. But how to code all that with python ? – Alya Mad Oct 13 '21 at 22:42
  • @K J Thank you for your answer. But it's not easy to do that in a python program – Alya Mad Oct 14 '21 at 14:28