Does PyPDF2 take any safety measures when opening an unsafe file?

Question

I'm wanting to use PyPDF2 (source, docs), but first wanted to make sure that it would be safe to use. I'm unable to find anything in it's docs. I want to use it to make sure that uploaded files are valid PDFs. Users are validated, but I'm concerned about them still being able to unknowingly upload something unsafe. Is there any way that PyPDF2 would be able to tell, even if it is a PDF, that it is unsafe?

It might be helpful if you identified what security risks you're concerned about. It seems that most PDF security risks come from executing code during rendering. (http://security.stackexchange.com/a/31551/46979 and http://security.stackexchange.com/a/31552/46979 are relevant. The properties of JavaScript mentioned also apply to Python.) PyPDF2 seems to simply be a PDF *parser* and generator. I doubt it actually renders the content (and therefore wouldn't execute code). — jpmc26, Sep 24 '14 at 19:16
Could PyPDF2 evaluate a portion of a file as python or execute the contents of it in some other way? — northben, Sep 24 '14 at 21:08

score 1 · Answer 1 · answered Oct 15 '14 at 19:50

1

Is there any way that PyPDF2 would be able to tell, even if it is a PDF, that it is unsafe?

No, because PyPDF2 does not contain any security scanning functionality. Any content which is harmful to your system may, or may not, pass through PyPDF and continue to be of danger to your system depending on what other precautions you take.

As jpmc26 said PyPDF is simply a parser/generator, so it is highly unlikely that the contents of a PDF could pose a security thread to PyPDF itself.

answered Oct 15 '14 at 19:50

M_M

135
1
8

I'd like to emphasize the last part "it is highly unlikely that the contents of a PDF could pose a security thread to [the system via] PyPDF2 itself". The worst thing that can happen is an infinite loop. The second worst thing is very long parsing time (e.g. quadratic complexity). We work on removing those issues whenever we see them. – Martin Thoma Dec 20 '22 at 22:25

score 0 · Answer 2 · answered Jan 24 '15 at 00:20

If you're concerned about validity of pdfs, if you try to manipulate a pdf with PyPDF2 that's not a valid pdf then it will likely return an error. As for checking the contents of the pdf, the library itself doesn't do that, but you can write methods for checking the contents for certain patterns, analyze the stream, and find other ways to check it yourself. The best way to start with that would be to create an invalid pdf yourself and find what things you would want to look for. It also has some password validation, though I've honestly not dealt with that part of the library. PyPDF2 is a pretty powerful tool if you can learn how to use it effectively!

score 0 · Answer 3 · answered Dec 20 '22 at 22:30

PyPDF2 doesn't execute parts of the PDF. It just parses it.

Bad things that can happen:

Infinite loops
Slow parsing e.g. due to quadratic runtime

We work hard on fixing those issues whenever we note them.

Another topic certainly is supply chain vulnerabilities. PyPDF2 is among the top 1% of packages on PyPI and thus the maintainers are required to use security keys. I review all PRs and I would not allow anything that allows code execution from the PDF itself / opens network connections / looks suspicious.

FYI: I'm the current maintainer of PyPDF2.

Does PyPDF2 take any safety measures when opening an unsafe file?

3 Answers3