I am trying to extract vertical text elements of a PDF with PDFMiner.six. For this I use the parameters detect_vertical=True, all_texts=True
. The vertical elements are detected, but unfortunately the spaces between the words are missing. An extracted vertical text element is shown like this:
Itisalongestablishedfactthatareaderwillbedistractedbythereadablecontentofapagewhenlookingatits layout.
I have tried two solutions that also address extraction of vertical text elements with PDFMiner.six, but none of them worked me.
Code from second solution:
rsrcmgr = PDFResourceManager()
laparams = LAParams(detect_vertical=True, all_texts=True)
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = PDFPage.get_pages(fp)
Detecting vertical text elements (not just text content) with pdfminer.six
Python PdfMiner - How to get the info on the orientation of each word/sentence included in a pdf?
Is it possible to extract vertical text elements with PDFMiner.six or does it not work with all PDFs? Textract correctly extracts vertical text elements, but for my project I need PDFMinder.six.