0

I am trying to extract vertical text elements of a PDF with PDFMiner.six. For this I use the parameters detect_vertical=True, all_texts=True. The vertical elements are detected, but unfortunately the spaces between the words are missing. An extracted vertical text element is shown like this:

Itisalongestablishedfactthatareaderwillbedistractedbythereadablecontentofapagewhenlookingatits layout.

I have tried two solutions that also address extraction of vertical text elements with PDFMiner.six, but none of them worked me.

Code from second solution:

rsrcmgr = PDFResourceManager()
laparams = LAParams(detect_vertical=True, all_texts=True)
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = PDFPage.get_pages(fp)

Detecting vertical text elements (not just text content) with pdfminer.six

Python PdfMiner - How to get the info on the orientation of each word/sentence included in a pdf?

Is it possible to extract vertical text elements with PDFMiner.six or does it not work with all PDFs? Textract correctly extracts vertical text elements, but for my project I need PDFMinder.six.

sal nixon
  • 17
  • 1
  • 4
  • The problem is that a PDF page has several vertical and horizontal text elements. I cannot rotate the text elements individually, I need some form of automation. I extract text from scientific publications, which is a very demanding type of PDF. – sal nixon Feb 16 '23 at 18:40
  • Under the following link you will find a PDF from which I have extracted the text. https://link.springer.com/content/pdf/10.1007/s00125-016-3902-y.pdf?pdf=button Pages 10 and 11 basically consist only of vertical text elements. Pdfminer outputs the text without spaces. – sal nixon Feb 21 '23 at 15:17

0 Answers0