2

I'm looking for a way to extract text and the position of that text from a PDF with a "text layer". My goal is to show a PDF with the extracted text as a layer and to have the user select certain lines as areas of interest.

pdftotext only shows me the text in rows, but without position information. I checked TET from PDFlib but they don't have a trial version and it doesn't seem like the libraries are actively maintained anymore.

The program should work on Linux

  • Great question! Did you ever find an answer? :\ – jtlz2 Sep 17 '19 at 10:57
  • Making *some* progress - https://stackoverflow.com/a/49306240/1021819 -> https://github.com/tesseract-ocr/tesseract/issues/1769 - but what OCR engine are you in fact using? – jtlz2 Sep 17 '19 at 11:09
  • It was a very generic question, because I couldn't find a library etc. already providing this. – Moritz Schroeder Oct 07 '19 at 17:47

0 Answers0