0

I'm using Python 3.4 on Windows 7 and hoping I can extract text from PDF files using PDFMiner. However, losing information was quite common when I was testing. For some files, it may be just a matter of a few sentences. But I've encountered situations where half of the text could not be extracted, depending on the file format. Here's my full code:

import io
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams


def convert_pdf(pdfFile, retstr):
    password = ''
    pagenos = set()
    maxpages = 0
    laparams = LAParams()
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)
    process_pdf(rsrcmgr, device, pdfFile, pagenos, maxpages=maxpages, password=password, check_extractable=True)
    device.close()
    return retstr


def extract_pdf(file_name, language):
    pdfFile = open(file_name, 'rb')
    retstr = io.StringIO()
    retstr = convert_pdf(pdfFile, retstr)
    whole = retstr.getvalue()
    original_texts = whole.split('\n')
    pdfFile.close()
    return original_texts

I wonder if there's a way to extract the full text using PDFMiner. I've heard of poppler, but I can't seem to find how to use it as a Python library. Besides, I don't want to use the command line. Can anyone help?

Here's an example: a thesis. Several paragraphs were lost when extracting using the code above. Like in the 2nd page, I could only extract first half of the page until "Pereira, Tishby, and Lee (1993)" at the middle. Then it just skip right to the next page for no apparent reason.

joe wong
  • 453
  • 2
  • 9
  • 24
  • have you tried pdfminer using python2.7? https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167 – glls Jun 16 '16 at 03:18
  • @glls I have to use Python 3.4 for business reasons. The pdfminer package used in the code above is actually pdfminer3k, the pdfminer for Python 3. But I doubt the results would differ from the ones generated from Python 2.7. – joe wong Jun 16 '16 at 03:49
  • 2
    Can you share sample documents making your issue reproducible? There are numerous PDFs in the wild which (at least partially) prevent the extraction of text, some accidentally, some intentionally. – mkl Jun 16 '16 at 09:19
  • @mkl Here's an example: [a thesis](http://arxiv.org/pdf/cs/9809110.pdf). Several paragraphs were lost when extracting using the code above. – joe wong Jun 17 '16 at 06:16
  • 1
    I added your example link to your question. Could you also indicate which paragraphs are missing so people here do not have to search? – mkl Jun 17 '16 at 08:38
  • @mkl Edited. Thanks. – joe wong Jun 17 '16 at 09:21

1 Answers1

0

Even though the question is quite dated, I will still post the results of my test here. It might be useful for someone. I just tested your file.

I was able to extract all text-based PDF Miner older versions up to 20200726 using the code below. Change the file source path and output path accordingly.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
import io
import os

fp = open('/Users/isanka/Downloads/9809110.pdf', 'rb')
rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
print(type(retstr))
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)

page_no = 0
for pageNumber, page in enumerate(PDFPage.get_pages(fp)):
    if pageNumber == page_no:
        interpreter.process_page(page)
        data = retstr.getvalue()
        with open(os.path.join('/Users/isanka/Downloads/tmp', f'pdf page {page_no}.txt'), 'wb') as file:
            file.write(data.encode('utf-8'))
        data = ''
        retstr.truncate(0)
        retstr.seek(0)

    page_no += 1
Isanka Wijerathne
  • 3,746
  • 3
  • 26
  • 34