0

Have tesseract-ocr v3.02.02 installed on Windows 7, and have used it via the command line:

1) Output png text to a text file: tesseract image.png txtfile 2) Output png text to a html file: tesseract image.png htmlfile hocr

I need it to be able to markup any italic text in the output text or html file. How do I do this (preferably on the command line - never used it in API mode)?

user2417713
  • 167
  • 2
  • 15

1 Answers1

0

The hocr output by Tesseract includes only the word coordinates and confidence values, not font-related information. As such, you will need to modify the source code to output what you want for the command-line mode, or use its API.

nguyenq
  • 8,212
  • 1
  • 16
  • 16
  • Thank you for that. I would be grateful if you could highlight which file I need to edit, with some example code to output the italic text markup. Alternatively, suggest how I may achieve this via its API - not familiar with the Tesseract API. I am familiar with PHP / JavaScript, but not done much with C / C++. – user2417713 Sep 26 '14 at 18:05
  • You will need to visit [Tesseract site](https://code.google.com/p/tesseract-ocr/) and forum for those info. Read [API examples](https://code.google.com/p/tesseract-ocr/wiki/APIExample) for usage of `ResultIterator` class and check the Issues page for problems related to hocr to find the classes/files responsible for hocr output. – nguyenq Sep 27 '14 at 02:51