0

Is there a way to create an XSL-FO that can have as input an HOCR generated with tesseract to produce the PDF with searchable text?

mzjn
  • 48,958
  • 13
  • 128
  • 248
Qsebas
  • 458
  • 3
  • 15
  • hOCR is a variant of XHTML isn't it? So yes, it should be doable with a bit of XSLT. But perhaps hcr-pdf (https://github.com/tmbdev/hocr-tools#hocr-pdf) is easier. Have you tried anything? – mzjn Jul 05 '17 at 13:24
  • I'm surprised not encountering it already solved in google, I suppose it is indeed a common issue. – Qsebas Jul 05 '17 at 17:39
  • I have as an alternative plan using hocr-pdf... but the fact that I'm already using FOP + XSL:FO + XML to produce the pdf as a sequence of images (and proven to be 10 times faster than creating the pdf directly in Python in a similar way than it is done in hocr) is leading me to continue for a couple of days trying to solve it with XSL:FO – Qsebas Jul 05 '17 at 17:42
  • Tesseract can produce a PDF directly: `tesseract -l lang inputname outputname pdf` – zuphilip Jul 06 '17 at 05:14
  • thaks for the suggestion, but isnot what I need exactly because I have pages stored with the correspondent hocr (I want to allow editing the hocr in the future) and the user can choose several pages and produce a pdf. – Qsebas Jul 07 '17 at 16:44
  • wright now I'm quite advanced with the XSL:FO, as soon I finish it I will put it here – Qsebas Jul 07 '17 at 16:45

0 Answers0