-1

The pdf contains data separated line after line and there is a table after a line ,that contains heading and its corresponding value below it , i am unable to get it in an orderly manner ,but rather i get the complete column header one after the other as text.I am able to get the data ,present line after line ,to associate heading and its corresponding value ,i am unable to do the same for the table.

fp = open(my_file, "rb")
parser = PDFParser(fp)
document = PDFDocument(parser)
if not document.is_extractable:
     raise PDFTextExtractionNotAllowed

rsrcmgr = PDFResourceManager()
laparams = LAParams()
laparams.line_margin = 12
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(document):
    interpreter.process_page(page)
    layout = device.get_result()
    for lt_obj in layout:
        if isinstance(lt_obj,LTTextBox):
            extracted_text += lt_obj.get_text()

print  extracted_text 
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
senor elanza
  • 41
  • 10

1 Answers1

1

PDFs are not laid out in any specific order (although usually the order is not totally random).

You will need to find the headers and then deduce the rows' content from the X,Y position if the text.

zmbq
  • 38,013
  • 14
  • 101
  • 171