I'm having a hard time reading a pdf from the internet into the python PdfFileReader object.
My code works for the first url, but it doesn't for the second and I don't know how to fix it.
I can see that in the first example, the url refers to a .pdf file and in the second url the pdf is being returned as 'application data' in the html body.
So I think this this might be the issue. Does anybody knows how to fix it so the code also works for the second url?
from pyPdf import PdfFileWriter, PdfFileReader
from io import BytesIO
import requests
def test(url,filename):
response=requests.get(url)
pdf_file = BytesIO(response.content)
existing_pdf = PdfFileReader(pdf_file)
page = existing_pdf.getPage(0)
output = PdfFileWriter()
output.addPage(page)
outputStream = file(filename, "wb")
output.write(outputStream)
outputStream.close()
test('https://s21.q4cdn.com/374334112/files/doc_downloads/test.pdf','works.pdf')
test('https://eservices.minfin.fgov.be/mym-api-rest/finform/pdf/2057','crashes.pdf')
This is the stacktrace I have with the second call of the test function:
D:\scripts>test.py
Traceback (most recent call last):
File "D:\scripts\test.py", line 21, in <module>
test('https://eservices.minfin.fgov.be/mym-api-rest/finform/pdf/2057','crashes.pdf')
File "D:\scripts\test.py", line 10, in test
page = existing_pdf.getPage(0)
File "C:\Python27\lib\site-packages\pyPdf\pdf.py", line 450, in getPage
self._flatten()
File "C:\Python27\lib\site-packages\pyPdf\pdf.py", line 596, in _flatten
catalog = self.trailer["/Root"].getObject()
File "C:\Python27\lib\site-packages\pyPdf\generic.py", line 480, in __getitem__
return dict.__getitem__(self, key).getObject()
File "C:\Python27\lib\site-packages\pyPdf\generic.py", line 165, in getObject
return self.pdf.getObject(self).getObject()
File "C:\Python27\lib\site-packages\pyPdf\pdf.py", line 655, in getObject
raise Exception, "file has not been decrypted"
Exception: file has not been decrypted