1

I have a simple problem in trying to detect the vertical text elements within pdfminer.six. I can read vertical text with no problem using a code snippet like this:

output_string = StringIO()
with open('../example_files/example1.pdf', 'rb') as infi:
    parser = PDFParser(infi)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams(detect_vertical=True, all_texts=True))
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
print(output_string.getvalue())

However, whenever I try to use PDFPageAggregator instead of TextConverter so that I can get the objects, like so:

with open('../example_files/example1.pdf', 'rb') as infi:
    parser = PDFParser(infi)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = PDFPageAggregator(rsrcmgr, laparams=LAParams(detect_vertical=True, all_texts=True))
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
        layout = device.get_result()
        for element in page_layout:
            print(element)

I'll capture Horizontal text boxes (as well lines, rects, etc.) but I won't capture the vertical text. Is there a way for me to capture the vertical text at the object hierarchy level so that I can inspect it's position?

gammapoint
  • 1,083
  • 2
  • 15
  • 27
  • Could you share the PDF? – hellpanderr Feb 15 '22 at 17:09
  • Sorry, I'm not allowed to share the PDF that I'm working with for legal reasons. – gammapoint Feb 15 '22 at 17:42
  • I was hoping that perhaps there was something incorrect with my code above that could be pointed out, but if it's a nuance of my PDF (and my code is good) then I could just forget about this specific PDF and work with the others. I'll check this but was surprised that I can extract the vertical text with TextConverter but not as an element in the second code snippet. – gammapoint Feb 15 '22 at 18:19

1 Answers1

1

It took me awhile to figure this out, but the key was realizing that text elements can be children of LTImage objects. I didn't realize that and didn't realize that I needed to recursively iterate over the children of LTImage objects to find everything.

gammapoint
  • 1,083
  • 2
  • 15
  • 27