Python - PyPDF2 misses large chunk of text. Any alternative on Windows?

Question

I tried to parse a pdf file with the PyPDF2 but I only retrieve about 10% of the text. For the remaining 90%, pyPDF2 brings back only newlines... a bit frustrating.

Would you know any alternatives on Python running on Windows? I've heard of pdftotext but it seems that I can't install it because my computer does not run on Linux.

Any idea?

import PyPDF2

filename = 'Doc.pdf'
pdf_file = PyPDF2.PdfFileReader(open(filename, 'rb'))

print(pdf_file.getPage(0).extractText())

Without an example PDF file, this is nearly impossible to debug. It would also be helpful to know if the exact same code works as expected on most PDF files but just fails on this one particular file, or if it's failing on most files you throw at it. — abarnert, Apr 29 '18 at 20:34
See for instance https://www.groupe-casino.fr/fr/wp-content/uploads/sites/5/2018/04/2018-04-17-CP-Chiffre-d-affaires-T1-2018.pdf It's something that I have noticed on many files — Shimuno, Apr 29 '18 at 21:34

score 1 · Answer 1 · answered May 08 '18 at 20:19

Try PyMuPDF. The following example simply prints out the text it finds. The library also allows you to get the position of the text if that would help you.

#!python3.6
import json

import fitz  # http://pymupdf.readthedocs.io/en/latest/


pdf = fitz.open('2018-04-17-CP-Chiffre-d-affaires-T1-2018.pdf')
for page_index in range(pdf.pageCount):
    text = json.loads(pdf.getPageText(page_index, output='json'))
    for block in text['blocks']:
        if 'lines' not in block:
            # Skip blocks without text
            continue
        for line in block['lines']:
            for span in line['spans']:
                print(span['text'].encode('utf-8'))
pdf.close()

Python - PyPDF2 misses large chunk of text. Any alternative on Windows?

1 Answers1