The extractText() fucntion does not return text

Question

pdfFileObject = open('MDD.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    print(page.extractText()

Above is my code and when i run the script it just outputs a bunch of numbers and numerical(s) and not the text of the file. Could anyone help me with that?

Could you please add the output of your code? Also, is it possible to upload MDD.pdf somewhere? Maybe your pdf is the problem, as PyPDF2 and the extractText method works perfectly fine for me. — Jürgen Gmach, Jan 26 '20 at 15:30
You should use a context manager to handle file objects. Also, variable and function names should follow the `lower_case_with_underscores` style. — AMC, Jan 26 '20 at 21:22

score 1 · Answer 1 · answered Jan 26 '20 at 15:40

This function doesn't work for all PDF files. This is explained in documentation:

This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated. :return: a unicode string object.

Try your code on this file. I'm sure it should work, so it seems that the problem is not in your code.

If you really need to parse files that are created the same way as your original MDD.pdf you have to choose another library.

Which library do you suggest? – danited Jan 26 '20 at 15:45 — danited, Jan 26 '20 at 15:45

The extractText() fucntion does not return text

1 Answers1