Is there a way to generate a FO with a HOCR input file?

Asked Jul 04 '17 at 18:56

Active Jul 05 '17 at 13:25

Viewed 189 times

Is there a way to create an XSL-FO that can have as input an HOCR generated with tesseract to produce the PDF with searchable text?

edited Jul 05 '17 at 13:25

mzjn

asked Jul 04 '17 at 18:56

Qsebas

hOCR is a variant of XHTML isn't it? So yes, it should be doable with a bit of XSLT. But perhaps hcr-pdf (https://github.com/tmbdev/hocr-tools#hocr-pdf) is easier. Have you tried anything? – mzjn Jul 05 '17 at 13:24
I'm surprised not encountering it already solved in google, I suppose it is indeed a common issue. – Qsebas Jul 05 '17 at 17:39
I have as an alternative plan using hocr-pdf... but the fact that I'm already using FOP + XSL:FO + XML to produce the pdf as a sequence of images (and proven to be 10 times faster than creating the pdf directly in Python in a similar way than it is done in hocr) is leading me to continue for a couple of days trying to solve it with XSL:FO – Qsebas Jul 05 '17 at 17:42
Tesseract can produce a PDF directly: `tesseract -l lang inputname outputname pdf` – zuphilip Jul 06 '17 at 05:14
thaks for the suggestion, but isnot what I need exactly because I have pages stored with the correspondent hocr (I want to allow editing the hocr in the future) and the user can choose several pages and produce a pdf. – Qsebas Jul 07 '17 at 16:44
wright now I'm quite advanced with the XSL:FO, as soon I finish it I will put it here – Qsebas Jul 07 '17 at 16:45

0 Answers0