How to extract corresponding column data from pdf

Question

The pdf contains data separated line after line and there is a table after a line ,that contains heading and its corresponding value below it , i am unable to get it in an orderly manner ,but rather i get the complete column header one after the other as text.I am able to get the data ,present line after line ,to associate heading and its corresponding value ,i am unable to do the same for the table.

fp = open(my_file, "rb")
parser = PDFParser(fp)
document = PDFDocument(parser)
if not document.is_extractable:
     raise PDFTextExtractionNotAllowed

rsrcmgr = PDFResourceManager()
laparams = LAParams()
laparams.line_margin = 12
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(document):
    interpreter.process_page(page)
    layout = device.get_result()
    for lt_obj in layout:
        if isinstance(lt_obj,LTTextBox):
            extracted_text += lt_obj.get_text()

print  extracted_text

score 1 · Answer 1 · answered Dec 31 '17 at 20:02

1

PDFs are not laid out in any specific order (although usually the order is not totally random).

You will need to find the headers and then deduce the rows' content from the X,Y position if the text.

answered Dec 31 '17 at 20:02

zmbq

38,013
14
101
171

How do i do that , there is no well-defined documentation for pdfminer. – senor elanza Jan 01 '18 at 02:36

How to extract corresponding column data from pdf

1 Answers1