0

When trying to get numbers from a pdf, using PyPDF2, I get:

KeyError: '/Contents'. Here is the code:

import PyPDF2 as pdf    
fhand = open('filepdf.pdf', 'rb')
reader = pdf.PdfFileReader(fhand)
if reader.isEncrypted == True:
    pass
else:
    for i in range(reader.getNumPages()):
        for word in reader.getPage(i).extractText().split():
            if word.isdigit():
                print(word)

The code works fine with other pdf files, here is the traceback:

Traceback (most recent call last):
  File "C:\Users\Root\AppData\Local\Programs\Python\Python38-32\lib\runpy.py", line 193, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Root\AppData\Local\Programs\Python\Python38-32\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "c:\Users\Root\.vscode\extensions\ms-python.python-2020.4.76186\pythonFiles\lib\python\debugpy\no_wheels\debugpy\__main__.py", line 45, in <module>
    cli.main()
  File "c:\Users\Root\.vscode\extensions\ms-python.python-2020.4.76186\pythonFiles\lib\python\debugpy\no_wheels\debugpy/..\debugpy\server\cli.py", line 430, in main
    run()
  File "c:\Users\Root\.vscode\extensions\ms-python.python-2020.4.76186\pythonFiles\lib\python\debugpy\no_wheels\debugpy/..\debugpy\server\cli.py", line 267, in run_file
    runpy.run_path(options.target, run_name=compat.force_str("__main__"))
  File "C:\Users\Root\AppData\Local\Programs\Python\Python38-32\lib\runpy.py", line 263, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "C:\Users\Root\AppData\Local\Programs\Python\Python38-32\lib\runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "C:\Users\Root\AppData\Local\Programs\Python\Python38-32\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "c:\Users\Root\Desktop\test\test.py", line 9, in <module>
    for word in reader.getPage(i).extractText().split():
  File "C:\Users\Root\AppData\Local\Programs\Python\Python38-32\lib\site-packages\PyPDF2\pdf.py", line 2593, in extractText
    content = self["/Contents"].getObject()
  File "C:\Users\Root\AppData\Local\Programs\Python\Python38-32\lib\site-packages\PyPDF2\generic.py", line 516, in __getitem__
    return dict.__getitem__(self, key).getObject()
KeyError: '/Contents'
John Smith
  • 835
  • 1
  • 7
  • 19
Endre
  • 1
  • 1
  • 4
  • Where is the full traceback? It contains more useful details than "KeyError", and it will help us help you. But it could very well be related to the specific PDF file which you are trying to read. It is not so that *all* text can *always* be extracted form *any* PDF. Yours might be one of the more problematic ones. – Jongware May 10 '20 at 19:56

1 Answers1

0

for me pdfminer worked , pypdf2 giving error initially

pdf_file = open(file, 'rb')
output_string = StringIO()
with open(file, 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    # print(doc)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
        string = output_string.getvalue()
        string = re.sub('\n','',string)
        string = re.sub('  +',' ',string)    
    print(string)