12

First of all I am using Python 3.5.1 (32 bit version) I wrote the following program to add a pagenumber on all pages of my pdf files using PyPDF2 and reportlab:

#import modules
from os import listdir
from PyPDF2 import PdfFileWriter, PdfFileReader
import io
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import A4
#initial values of variable declarations
PDFlist=[]
X_value=460
Y_value=820
#Make a list of al files in de directory
filelist = listdir()
#Make a list of all pdf files in the directory
for i in range(0,len(filelist)):
    filename=filelist[i]
    for j in range(0,len(filename)):
        char=filename[j]
        if char=='.':
            extension=filename[j+1:j+4]
            if extension=='pdf':
                PDFlist.append(filename)
        j=j+1
    i=i+1
# Give the horizontal position for the page number (Enter = use default value of 480)
User = input('Give horizontal position page number (ENTER = default 460): ')
if User != "":
    X_value=int(User)
# Give the vertical position for the page number (Enter = use default value of 820)
User = input('Give horizontal position page number (ENTER = default 820): ')
if User != "":
    Y_value=int(User)

for i in range(0,len(PDFlist)):
    filename=PDFlist[i]

    # read the PDF
    existing_pdf = PdfFileReader(open(filename, "rb"))
    print("File: "+filename)
    # count the number of pages
    number_of_pages = existing_pdf.getNumPages()
    print("Number of pages detected:"+str(number_of_pages))
    output = PdfFileWriter()

    for k in range(0,number_of_pages):
        packet = io.BytesIO()

        # create a new PDF with Reportlab
        can = canvas.Canvas(packet, pagesize=A4)
        Pagenumber=" Page "+str(k+1)+"/"+str(number_of_pages)
        # we first make a white rectangle to cover any existing text in the pdf
        can.setFillColorRGB(1,1,1)
        can.setStrokeColorRGB(1,1,1)
        can.rect(X_value-10,Y_value-5,120,20,fill=1)
        # set the font and size
        can.setFont("Helvetica",14)
        # choose color of page numbers (red)
        can.setFillColorRGB(1,0,0)
        can.drawString(X_value, Y_value, Pagenumber)
        can.save()
        print(Pagenumber)

        #move to the beginning of the StringIO buffer
        packet.seek(0)
        new_pdf = PdfFileReader(packet)
        # add the "watermark" (which is the new pdf) on the existing page
        page = existing_pdf.getPage(k)
        page.mergePage(new_pdf.getPage(0))
        output.addPage(page)
        k=k+1
    # finally, write "output" to a real file

    ResultPDF="Output/"+filename
    outputStream = open(ResultPDF, "wb")
    output.write(outputStream)
    outputStream.close()
    i=i+1

This program works fine for quite a number of PDF files (albeit that warnings are sometimes generated like 'PdfReadWarning: Superfluous whitespace found in object header b'16' b'0' [pdf.py:1666]' but the resulting output file is okay to me). However, the program just doesn't work on some PDF files although these files are perfectly readable and editable with my Adobe Acrobat. I have the impression the error pops up mostly on PDF files that were scanned but not on all of them (I also numbered scanned PDF files that didn't generate any error). I am getting the following error message (the first 8 lines are the result of my own print commands):

File: Scanned file.pdf
Number of pages detected:6
 Page 1/6
 Page 2/6
 Page 3/6
 Page 4/6
 Page 5/6
 Page 6/6
PdfReadWarning: Object 25 1 not defined. [pdf.py:1629]
Traceback (most recent call last):
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\Sourcecode\PDFPager.py", line 83, in <module>
    output.write(outputStream)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 482, in write
    self._sweepIndirectReferences(externalReferenceMap, self._root)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 571, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 571, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 556, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, data[i])
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 571, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 556, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, data[i])
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 577, in _sweepIndirectReferences
    newobj = data.pdf.getObject(data)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 1631, in getObject
    raise utils.PdfReadError("Could not find object.")
PyPDF2.utils.PdfReadError: Could not find object.

Apparently the pages are merged with the PDF created by reportlab (see lines up to page 6/6) but in the end no output PDF file can be generated by PyPDF2 (I get an unreadible output file of 0 bytes). Can somebody shed some light on how to resolve this? I searched the internet but couldn't really find an answer.

Max Eisert
  • 129
  • 1
  • 1
  • 6
  • I had the same error message when calling the same function. Is your PDF fillable? The problem was resolved when I converted the PDF to "regular" read-only PDF. – SYK Aug 01 '18 at 12:48
  • 2
    In the meantime I also found a workaround by printing the pdf file via the pdf printer the problem is solved. – Max Eisert Aug 02 '18 at 16:40
  • haha, yes, that is indeed equivalent. – SYK Aug 03 '18 at 16:02
  • I think before merging files, first check if the files are broken. Then merge them. If files are broken or they are not fully downloaded, merging will not succed. – GoingMyWay Jun 20 '19 at 06:25
  • The files were not broken. I could read them with pdf reader without a problem. I could however not merge them using the python code. – Max Eisert Jun 21 '19 at 09:46

4 Answers4

6

On pdf.py do the following changes:

on line 1633 of pdf. py (which means uncommenting the if self.strict)

    if self.strict:
        raise utils.PdfReadError("Could not find object.")

and on line 501 on pdf.py make the following changes (adding a try, except block)

    try:
        obj.writeToStream(stream, key)
        stream.write(b_("\nendobj\n"))
    except:
        pass

Cheers.

bmg
  • 121
  • 1
  • 3
  • 5
  • Cool. This fix should definitely be pushed into master. However seems pypdf2 is unmaintained now :( – Shaohua Li Sep 16 '19 at 14:16
  • This same fix fixes the same problem on pypdf4; I posted a link to this topic on the thread for the relevant bug there. pypdf4 seems less inactive than pypdf2. – Watusimoto Jan 02 '20 at 23:40
  • @Watusimoto thanks for letting me know! I added a comment below it. Let's hope the repo owner notices it. – bmg Jan 05 '20 at 13:05
  • @bmg - I also posted this question on the associated [Github issue](https://github.com/mstamy2/PyPDF3/issues/7), feel free to respond here or there and I'll X-post. We're looking to incorporate your workaround to get around this issue but are not sure about the consequences. It looks like an error is simply being ignored and an intentional, one would assume, conditional being uncommitted. Do you have an understanding of why_ this fixes the issue and if it will result in content being removed from a document? – mwakerman Jun 17 '20 at 03:53
  • @mwakerman looks like someone deleted the PyPDF3 repo... If you have the content of the issue, can you open that issue in PyPDF4 and put it here too? – bmg Dec 31 '21 at 02:30
  • Does this involve limitation when the file pdf is then read with PdfFileReader? I've filled authomatically a form, then I have to download it and to fill other fields manually. When I have then to read it with PdfFileReader I've some problems because it seems not recognizing more the fields. – SahFra98 Mar 22 '22 at 10:17
  • @SahFra98 I have stopped using PyPDF versions altogether because of this problem and the codebase is completely abandoned by its developer. I have switched over to PikePDF and would recommend doing so. If you just need to merge a few things you can check out my repository here for reference: https://github.com/gonultasbu/pdf_merge. – bmg Mar 22 '22 at 14:23
6

Using "strict = false" got things working for me.

from PyPDF2 import PdfFileMerger

pdfs = [r'file 1.pdf', r'file 2.pdf']

merger = PdfFileMerger(strict=False)

for pdf in pdfs:
    merger.append(pdf)

merger.write(r"thanks mate.pdf")
Ninga
  • 689
  • 7
  • 14
  • Hey yes I just reran with it set to True and the doc was sill created, just with a bunch of warnings. I thought it fixed an issue with the new doc not being created however my issue must have been different. – Ninga Jun 19 '19 at 22:39
  • I think before merging files, first check if the files are broken. Then merge them. If files are broken or they are not fully downloaded, merging will not succed. – GoingMyWay Jun 20 '19 at 06:25
  • I'm trying to not merge whole pdf files but some pages. I still get the error with strict=False. Modifying pdf.py with said changes work. So, pdf.py never got corrected ? – Sudhik Jan 07 '21 at 18:48
1

Here is my solution. Try to write the file into a dummy ByteIO stream to check whether it is broken.

    try:
        reader = PdfFileReader(input_file)
        print("Opening '{}', pages={}".format(file_path, reader.getNumPages()))
        # Try to write it into an dummy ByteIO stream to check whether pdf is broken
        writer = PdfFileWriter()
        writer.addPage(reader.getPage(0))
        writer.write(io.BytesIO())
    except PdfReadError:
        print("Error reading '{}".format(file_path))
        continue

    
Yue Zhang
  • 11
  • 1
0

I just encountered the same error with pypdf2. It's a problem related to pdf's version

Just use the pikepdf package and then the issue went away.

You can find documentation here

Zuko
  • 1
  • 2