1

I am trying to extract information out of a PDF using PDFMiner in a consistent manner so I can do further analysis but I can't figure out how to correctly extract tabular data. PDF Miner seems to extract columns before rows. Has anyone solved this problem or know a way to extract rows first? I tried extracting it to html but I ran into the same problem. Any help is greatly appreciated.

Image from actual pdf:

image from actual PDF

Image from extracted version

enter image description here

The code I used for the extraction is below:

import nltk
import numpy
import pip
import pdfminer
import dateutil
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    fstr = ''
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,    password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

        str = retstr.getvalue()
        fstr += str

    fp.close()
    device.close()
    retstr.close()
    return fstr

test1 =  convert_pdf_to_txt("C:\Users\User\Documents\Contract\Dental\Certificate - Dental - Assurant - 2010.pdf")
Yaegz
  • 669
  • 6
  • 15
  • Since you get the same result with other software, that tells us PDFMiner extracts text *in the order it appears in the file*. (That may even be mentioned in its documentation.) So look for position-dependent text extraction, for this particular PDF. – Jongware Oct 17 '15 at 17:59
  • @Jongware I have only tried using PDFMiner for the extraction. I will look for position-dependent pdf extraction although I am not sure it exists. – Yaegz Oct 17 '15 at 18:01
  • It does exist (if only because I wrote a program to do exactly that myself). Look for code that can extract x and y positions along with text, then sort that anyway you like. – Jongware Oct 17 '15 at 18:07

0 Answers0