5

I've been struggling with reading a text from a PDF in Python.

What I need is PyPDF2 to find a given string and return a reference number placed next to that string.

That's the code I'm trying:

import os
import shutil
import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader

jobpath = r"C:\Scrpts\scr\testPDF"

for files in os.listdir(jobpath):
        if files.endswith('.pdf'):
            filename = os.path.join(jobpath, files)
            with open(filename, 'rb') as pageObj1:

                pdfReader1 = PyPDF2.PdfFileReader(pageObj1)
                pdfReader1._override_encryption = True
                pageObj1 = pdfReader1.getPage(0)

                text1 = pageObj1.extractText()
                refNum = text1.partition("Reference")

                text1 = refNum[2]
                text1 = text1[0:30]
                a = 'Reference'
                b = '\n'
                text1 = text1.split(a)[-1].split(b)[0]
                refNum = text1
                print(filename + ' ' + refNum)

But this is giving a superfluous whitespace error:

PdfReadWarning: Superfluous whitespace found in object header b'1' b'0' [pdf.py:1668]
PdfReadWarning: Superfluous whitespace found in object header b'2' b'0' [pdf.py:1668]
PdfReadWarning: Superfluous whitespace found in object header b'3' b'0' [pdf.py:1668]
PdfReadWarning: Superfluous whitespace found in object header b'48' b'0' [pdf.py:1668]
PdfReadWarning: Superfluous whitespace found in object header b'95' b'0' [pdf.py:1668]
PdfReadWarning: Superfluous whitespace found in object header b'113' b'0' [pdf.py:1668]
PdfReadWarning: Superfluous whitespace found in object header b'126' b'0' [pdf.py:1668]
PdfReadWarning: Superfluous whitespace found in object header b'129' b'0' [pdf.py:1668]
PdfReadWarning: Superfluous whitespace found in object header b'140' b'0' [pdf.py:1668]
PdfReadWarning: Superfluous whitespace found in object header b'143' b'0' [pdf.py:1668]
PdfReadWarning: Superfluous whitespace found in object header b'146' b'0' [pdf.py:1668]
PdfReadWarning: Superfluous whitespace found in object header b'149' b'0' [pdf.py:1668]
PdfReadWarning: Superfluous whitespace found in object header b'152' b'0' [pdf.py:1668]
PdfReadWarning: Superfluous whitespace found in object header b'155' b'0' [pdf.py:1668]
PdfReadWarning: Superfluous whitespace found in object header b'158' b'0' [pdf.py:1668]
PdfReadWarning: Superfluous whitespace found in object header b'161' b'0' [pdf.py:1668]
PdfReadWarning: Superfluous whitespace found in object header b'164' b'0' [pdf.py:1668]
PdfReadWarning: Superfluous whitespace found in object header b'167' b'0' [pdf.py:1668]
PdfReadWarning: Superfluous whitespace found in object header b'170' b'0' [pdf.py:1668]
PdfReadWarning: Superfluous whitespace found in object header b'173' b'0' [pdf.py:1668]
PdfReadWarning: Superfluous whitespace found in object header b'184' b'0' [pdf.py:1668]
PdfReadWarning: Superfluous whitespace found in object header b'187' b'0' [pdf.py:1668]
PdfReadWarning: Superfluous whitespace found in object header b'190' b'0' [pdf.py:1668]
PdfReadWarning: Superfluous whitespace found in object header b'46' b'0' [pdf.py:1668]
C:\Scrpts\scr\testPDF\testPDF.pdf 

I have used a similar script in the past without any problems.

I tried to search for similar issues, however, I am unable to find any solution.

darkspeed
  • 51
  • 3
  • I have the same issue (and no solution!) if a execute `getPage(0)` on a single pdf. You are iterating, haven't you tried to get the text of a single pdf and see if you get the warning? – cards Sep 19 '21 at 17:19
  • 2
    I think the issue comes from PdfFileReader() method . add the strict argument to False. Its warning snot a fatal errors. I answered to the same problem in this topic . Check it if you need more explanations. https://stackoverflow.com/questions/70334338/errors-while-using-pyttsx3-pypdf2-for-making-an-audio-book – Wuzardor Feb 15 '22 at 23:59

0 Answers0