can only concatenate list (not "unicode") to list

Question

I have copy pasted some Lorem Ipsum in a Word.docx file, saved it as PDF and tried to run the following script for testing purposes to extract text from a PDF.

from pyPdf import PdfFileReader
if (fileExtension == ".PDF"):
     pdfDoc = PdfFileReader(file(FOLDER+j, "rb"))
     fileText = ""
     print("Processing a PDF file")
     for pdfpage in range(0,pdfDoc.getNumPages()):
           fileText = fileText + pdfDoc.getPage(pdfpage).extractText()
           fileText = cleantext(fileText)
           fileText = fileText.splitlines(True)
else:
     print("PLEASE CHOOSE A .PDF FILE")

It raises this particular error for any PDF file. HOWEVER!, when I run the code per line, then it does seem to work. So if I first run

      for pdfpage in range(0,pdfDoc.getNumPages()):
           fileText = fileText + pdfDoc.getPage(pdfpage).extractText()

then the next line, then the last line of fileText, it works. So what happens that I cannot see?

Could you elaborate a bit more? What error is showing, what do you mean with "running the code line by line"? — thomaux, Jun 20 '17 at 09:55
Error is in the header. themiurge below has suggested an answer, but it's not complete as I want fileText to work — PRIME, Jun 20 '17 at 09:59

themiurge · Accepted Answer · 2017-06-20T10:06:22.313

After reading the first page fileText is indeed a list, because that's what splitlines returns. When reading the second page, you add its full text to fileText (which is now a list). Hence the error: you cannot concatenate a string (pdfDoc.getPage(pdfpage).extractText()) to a list.

If you just need a list of lines, I suggest you rework your code like this:

from pyPdf import PdfFileReader
fileText = []
if (fileExtension == ".PDF"):
    pdfDoc = PdfFileReader(file(FOLDER+j, "rb"))
    print("Processing a PDF file")
    for pdfpage in range(0,pdfDoc.getNumPages()):
        pageText = pdfDoc.getPage(pdfpage).extractText()
        pageText = cleantext(pageText)
        fileText.append(pageText.splitlines(True))
else:
    print("PLEASE CHOOSE A .PDF FILE")

This stores all lines in fileText for later use.

By the way, when you run the code line-by-line it works because these two lines are executed outside the for loop:

fileText = cleantext(fileText)
fileText = fileText.splitlines(True)

This is the equivalent of what happens if you execute line-by-line as you described (notice the indentation):

from pyPdf import PdfFileReader
if (fileExtension == ".PDF"):
    pdfDoc = PdfFileReader(file(FOLDER+j, "rb"))
    fileText = ""
    print("Processing a PDF file")
    for pdfpage in range(0,pdfDoc.getNumPages()):
        fileText = pdfDoc.getPage(pdfpage).extractText()
    fileText = cleantext(fileText)
    fileText = fileText.splitlines(True)
else:
    print("PLEASE CHOOSE A .PDF FILE")

Thanks, however, I need the fileText for other purposes afterwards. For example when I loop two directories to make this: d.make_file(fileText, fileText2). And if I store it in fileLines, then I cannot use it like that. — PRIME, Jun 20 '17 at 09:58
What do you need to be stored in fileText? Full text or list of lines? — themiurge, Jun 20 '17 at 09:58
Thanks to your answer I figured it out! You have to unnest it afterwards though if you want to use it in a loop if (fileExtension == ".pdf"): fileText = list(chain.from_iterable(fileText)) — PRIME, Jun 20 '17 at 10:40

can only concatenate list (not "unicode") to list

1 Answers1