I tried to parse a pdf file with the PyPDF2 but I only retrieve about 10% of the text. For the remaining 90%, pyPDF2 brings back only newlines... a bit frustrating.
Would you know any alternatives on Python running on Windows? I've heard of pdftotext but it seems that I can't install it because my computer does not run on Linux.
Any idea?
import PyPDF2
filename = 'Doc.pdf'
pdf_file = PyPDF2.PdfFileReader(open(filename, 'rb'))
print(pdf_file.getPage(0).extractText())