0

I'm having a hard time reading a pdf from the internet into the python PdfFileReader object.

My code works for the first url, but it doesn't for the second and I don't know how to fix it.

I can see that in the first example, the url refers to a .pdf file and in the second url the pdf is being returned as 'application data' in the html body.

So I think this this might be the issue. Does anybody knows how to fix it so the code also works for the second url?

from pyPdf import PdfFileWriter, PdfFileReader
from io import BytesIO
import requests

def test(url,filename):
  response=requests.get(url)
  pdf_file = BytesIO(response.content)
  existing_pdf = PdfFileReader(pdf_file)

  page = existing_pdf.getPage(0)

  output = PdfFileWriter()
  output.addPage(page)

  outputStream = file(filename, "wb")
  output.write(outputStream)
  outputStream.close()


test('https://s21.q4cdn.com/374334112/files/doc_downloads/test.pdf','works.pdf')
test('https://eservices.minfin.fgov.be/mym-api-rest/finform/pdf/2057','crashes.pdf')

This is the stacktrace I have with the second call of the test function:

D:\scripts>test.py
Traceback (most recent call last):
  File "D:\scripts\test.py", line 21, in <module>
    test('https://eservices.minfin.fgov.be/mym-api-rest/finform/pdf/2057','crashes.pdf')
  File "D:\scripts\test.py", line 10, in test
    page = existing_pdf.getPage(0)
  File "C:\Python27\lib\site-packages\pyPdf\pdf.py", line 450, in getPage
    self._flatten()
  File "C:\Python27\lib\site-packages\pyPdf\pdf.py", line 596, in _flatten
    catalog = self.trailer["/Root"].getObject()
  File "C:\Python27\lib\site-packages\pyPdf\generic.py", line 480, in __getitem__
    return dict.__getitem__(self, key).getObject()
  File "C:\Python27\lib\site-packages\pyPdf\generic.py", line 165, in getObject
    return self.pdf.getObject(self).getObject()
  File "C:\Python27\lib\site-packages\pyPdf\pdf.py", line 655, in getObject
    raise Exception, "file has not been decrypted"
Exception: file has not been decrypted
Bosiwow
  • 2,025
  • 3
  • 28
  • 46
  • It appears the reply from the second link is a base64 encoded string, that I somehow have to read into the BytesIO object. – Bosiwow Feb 08 '18 at 16:12

1 Answers1

0

I found a solution. I imported PyPDF2 instead of pyPdf, so it was probably a bug.

Bosiwow
  • 2,025
  • 3
  • 28
  • 46