Merging PDF files using Python and PyPDF2 throws a TypeError

Question

I am using Python 3.6.5 to merge PDFs together but am running into a problem. The code below throws a 'TypeError: 'NumberObject' object is not subscriptable' error. What am I doing wrong? When I comment out the line with the merger.append, it prints out the file paths correctly.

import webbrowser
import os
from PyPDF2 import PdfFileMerger, PdfFileReader

path = 'C:/test/pdfs'
merger = PdfFileMerger()
for pdf in os.listdir(path):
      merger.append(PdfFileReader(open(os.path.join(path,pdf), 'rb')))
      print(os.path.join(path,pdf))
merger.write(path+'/merged.pdf')
merger.close()
webbrowser.open_new(path+'/merged.pdf')

File "C:\test\pdftest.py", line 9, in merger.append(PdfFileReader(open(os.path.join(path,pdf), 'rb'))) File "C:\python\lib\site-packages\pypdf2-1.26.0-py3.6.egg\PyPDF2\pdf.py", line 1084, in init self.read(stream) File "C:\python\lib\site-packages\pypdf2-1.26.0-py3.6.egg\PyPDF2\pdf.py", line 1805, in read assert xrefstream["/Type"] == "/XRef" TypeError: 'NumberObject' object is not subscriptable

When I change the merger.append to take a file path, I get:

File "C:\test\pdftest.py", line 9, in merger.append(os.path.join(path,pdf)) File "C:\python\lib\site-packages\pypdf2-1.26.0-py3.6.egg\PyPDF2\merger.py", line 203, in append self.merge(len(self.pages), fileobj, bookmark, pages, import_bookmarks) File "C:\python\lib\site-packages\pypdf2-1.26.0-py3.6.egg\PyPDF2\merger.py", line 133, in merge pdfr = PdfFileReader(fileobj, strict=self.strict) File "C:\python\lib\site-packages\pypdf2-1.26.0-py3.6.egg\PyPDF2\pdf.py", line 1084, in init self.read(stream) File "C:\python\lib\site-packages\pypdf2-1.26.0-py3.6.egg\PyPDF2\pdf.py", line 1805, in read assert xrefstream["/Type"] == "/XRef" TypeError: 'NumberObject' object is not subscriptable

UPDATE: It looks like one of the PDFs in the folder was causing this. The only thing different with that PDF is that it uses Type 1 font whereas the other PDFs use TrueType font. Does anyone know a workaround or fix for this?

File "C:\test\pdftest.py", line 9, in merger.append(PdfFileReader(open(os.path.join(path,pdf), 'rb'))) File "C:\python\lib\site-packages\pypdf2-1.26.0-py3.6.egg\PyPDF2\pdf.py", line 1084, in __init__ self.read(stream) File "C:\python\lib\site-packages\pypdf2-1.26.0-py3.6.egg\PyPDF2\pdf.py", line 1805, in read assert xrefstream["/Type"] == "/XRef" TypeError: 'NumberObject' object is not subscriptable — krazyboi, Apr 06 '18 at 22:03
[The documentation](https://pythonhosted.org/PyPDF2/PdfFileMerger.html) says that PdfFileMerger.append takes a file object or a pathname, not a PdfFileReader. — Dan D., Apr 06 '18 at 22:04
Some of the files in `path` are not files and are not PDF files. You need to filter those out from the result of `os.listdir(path)`. — Dan D., Apr 06 '18 at 22:11
@DanD. I've updated the post to show the traceback when I change PdfFileMerger.append to take a pathname. Also, the files in path are all PDF files. I created a new folder and placed the PDFs in there manually. — krazyboi, Apr 06 '18 at 22:16
@DanD. I tried moving PDFs into the folder one by one and running the script and it is one of the PDFs that cause this error. I wonder why a particular PDF file is causing this error. I see that the only difference in properties of the PDF file is that the one that causes the error uses a Type 1 font, whereas the others use a TrueType font. Can this be the cause? — krazyboi, Apr 06 '18 at 22:21
I tried using a couple of pdfs that I had, it worked for me. I used the same code. If you want I can try using the pdfs that you have. — Afsan Abdulali Gujarati, Apr 06 '18 at 23:38
@AfsanAbdulaliGujarati Thank you for helping, but the PDFs I am using are private. Thank you though! — krazyboi, Apr 09 '18 at 17:22
I found that there was garbage before the header on the top of the PDF file, looked like Javascript, removed that and it started working. — GuySoft, Jun 05 '19 at 10:32

score -1 · Answer 1 · answered Dec 31 '20 at 17:09

This seems to be caused by either unrecognised or bad PDF formatting. I'm no PDF expert but it seems PyPDF2 is complaining about a record in the XRef table. I've found the easiest way to get around this is to reformat the PDF.

What I do is put the merger.append(PDFFileReader(file)) in a try and if I find the 'NumberObject' object is not subscriptable message in the exception I "convert" the PDF with LibreOffice in headless mode via subprocess:

command = [r'"C:\Program Files\LibreOffice\program\soffice.bin"',
           '--convert-to', 'pdf', '--outdir', f'"{dest_file_path}"', f'"{file_name}"']
pdf_convert = subprocess.Popen(' '.join(command))

A note on using LibreOffice and subprocess: For whatever reason, I've found passing as a list causes an access denied error for me in Windows so that's why I do the join instead.

Merging PDF files using Python and PyPDF2 throws a TypeError

1 Answers1

Linked