I have copy pasted some Lorem Ipsum in a Word.docx file, saved it as PDF and tried to run the following script for testing purposes to extract text from a PDF.
from pyPdf import PdfFileReader
if (fileExtension == ".PDF"):
pdfDoc = PdfFileReader(file(FOLDER+j, "rb"))
fileText = ""
print("Processing a PDF file")
for pdfpage in range(0,pdfDoc.getNumPages()):
fileText = fileText + pdfDoc.getPage(pdfpage).extractText()
fileText = cleantext(fileText)
fileText = fileText.splitlines(True)
else:
print("PLEASE CHOOSE A .PDF FILE")
It raises this particular error for any PDF file. HOWEVER!, when I run the code per line, then it does seem to work. So if I first run
for pdfpage in range(0,pdfDoc.getNumPages()):
fileText = fileText + pdfDoc.getPage(pdfpage).extractText()
then the next line, then the last line of fileText, it works. So what happens that I cannot see?