3

I use the requests library in python to download a large number of image files via http. I convert the received content to raw bytes using BytesIO in python and then use Pillow() to save this raw content as a jpeg file.

from PIL import Image
from io import BytesIO

rsp = requests.get(imageurl)
content_type_received = rsp.headers['Content-Type'] # mime type
binarycontent = BytesIO(rsp.content)
if content_type_received.startswith('image'): # image/jpeg, image/png etc
    i = Image.open(binarycontent)
    outfilename = os.path.join(outfolder,'myimg'+'.jpg')
    with open(outfilename, 'wb') as f:
        f.write(rsp.content)
    rsp.close()

What is the potential security risk of this code? (I am not sure how much we can trust the server saying mime type in the response header is really what the server says it is?) Is there a better way to write a secure download routine?

hAcKnRoCk
  • 1,118
  • 3
  • 16
  • 30

1 Answers1

6

The potential security risk of your code depends on how much you trust the server your contacting. If you're sure that the server will never try to fool you with some malicious content, then you're relatively safe to use that piece of code. Otherwise, check for the content-type by yourself. The biggest potential risk might to unknowingly save an executable rather than an image. A smaller one might be to store a different kind of content that may crash PIL or another component in your application.

Keep in mind that the server is free to choose whatever value it wants for any response headers, including the content-type. If you have any reason to believe the server you're contacting might not be honest about it, you shouldn't trust request headers.

If you want a more reliable way to determine the content type of the content you received, I suggest you take a look at python-magic, a wrapper for libmagic. This library will help you determine yourself the content type, so you don't have to "trust" the server you're downloading from.

# ...
content = BytesIO(rsp.content)
mime = magic.from_buffer(content.read(1024), mime=True)
if mime.startswith('image'):
    content.seek(0) # Reset the bytes stream position because you read from it
    # ...

python-magic is very well documented, so I recommend you have a look at their README if you consider user it.

Alvae
  • 1,254
  • 12
  • 22
  • Nice answer. Before I accept it, why are only 1024 bytes read from the response content? Because it is sufficient to infer the mime type for an image from it? Just out of curiosity, how is this determined if what is requested is a media of another type, let us say mp4 for instance? – hAcKnRoCk Mar 27 '17 at 15:37
  • 1
    The appropriate number of bytes you have to read to accurately evaluate the mimetype is difficult to know, as it depends a lot on the type of file you are reading. Some file signatures even place this information with an offset. 1024 bytes should be plenty enough for any kind of image types, but I have to admit the value is rather "Internet-knowledge" like. – Alvae Mar 28 '17 at 11:41