0

I have copy pasted some Lorem Ipsum in a Word.docx file, saved it as PDF and tried to run the following script for testing purposes to extract text from a PDF.

from pyPdf import PdfFileReader
if (fileExtension == ".PDF"):
     pdfDoc = PdfFileReader(file(FOLDER+j, "rb"))
     fileText = ""
     print("Processing a PDF file")
     for pdfpage in range(0,pdfDoc.getNumPages()):
           fileText = fileText + pdfDoc.getPage(pdfpage).extractText()
           fileText = cleantext(fileText)
           fileText = fileText.splitlines(True)
else:
     print("PLEASE CHOOSE A .PDF FILE")

It raises this particular error for any PDF file. HOWEVER!, when I run the code per line, then it does seem to work. So if I first run

      for pdfpage in range(0,pdfDoc.getNumPages()):
           fileText = fileText + pdfDoc.getPage(pdfpage).extractText()

then the next line, then the last line of fileText, it works. So what happens that I cannot see?

PRIME
  • 73
  • 1
  • 3
  • 10
  • Could you elaborate a bit more? What error is showing, what do you mean with "running the code line by line"? – thomaux Jun 20 '17 at 09:55
  • Error is in the header. themiurge below has suggested an answer, but it's not complete as I want fileText to work – PRIME Jun 20 '17 at 09:59

1 Answers1

0

After reading the first page fileText is indeed a list, because that's what splitlines returns. When reading the second page, you add its full text to fileText (which is now a list). Hence the error: you cannot concatenate a string (pdfDoc.getPage(pdfpage).extractText()) to a list.

If you just need a list of lines, I suggest you rework your code like this:

from pyPdf import PdfFileReader
fileText = []
if (fileExtension == ".PDF"):
    pdfDoc = PdfFileReader(file(FOLDER+j, "rb"))
    print("Processing a PDF file")
    for pdfpage in range(0,pdfDoc.getNumPages()):
        pageText = pdfDoc.getPage(pdfpage).extractText()
        pageText = cleantext(pageText)
        fileText.append(pageText.splitlines(True))
else:
    print("PLEASE CHOOSE A .PDF FILE")

This stores all lines in fileText for later use.

By the way, when you run the code line-by-line it works because these two lines are executed outside the for loop:

fileText = cleantext(fileText)
fileText = fileText.splitlines(True)

This is the equivalent of what happens if you execute line-by-line as you described (notice the indentation):

from pyPdf import PdfFileReader
if (fileExtension == ".PDF"):
    pdfDoc = PdfFileReader(file(FOLDER+j, "rb"))
    fileText = ""
    print("Processing a PDF file")
    for pdfpage in range(0,pdfDoc.getNumPages()):
        fileText = pdfDoc.getPage(pdfpage).extractText()
    fileText = cleantext(fileText)
    fileText = fileText.splitlines(True)
else:
    print("PLEASE CHOOSE A .PDF FILE")
themiurge
  • 1,619
  • 17
  • 21
  • Thanks, however, I need the fileText for other purposes afterwards. For example when I loop two directories to make this: d.make_file(fileText, fileText2). And if I store it in fileLines, then I cannot use it like that. – PRIME Jun 20 '17 at 09:58
  • What do you need to be stored in fileText? Full text or list of lines? – themiurge Jun 20 '17 at 09:58
  • a list of lines for comparison purposes – PRIME Jun 20 '17 at 10:00
  • Thanks to your answer I figured it out! You have to unnest it afterwards though if you want to use it in a loop if (fileExtension == ".pdf"): fileText = list(chain.from_iterable(fileText)) – PRIME Jun 20 '17 at 10:40