Can I use OCR to detect font style (bold, italic)?

Question

I am interested in using OCR to extract bold and italic words from a simple text. For example, if I input a clear image with text like so:

"The quick brown fox jumps over the lazy dog."

I would like to get an output like so: bold("brown", "jumps"), italic("lazy")

I have looked into doing this with OCRopus or Tesseract, but the documentation is poor and I can't tell if it's possible, or how to do it if it is.

I would suggest you try ABBYY Cloud OCR. Please se my answer https://stackoverflow.com/a/63098644/2598453 — Milan Hlinák, Nov 26 '20 at 22:55

score 14 · Accepted Answer · edited Jul 24 '17 at 09:41

14

There is such function in Tesseract 3.0.1, from trunk. A new class is added to the API - ResultIterator, which has the following function you are interested in:

 WordFontAttributes(bool* is_bold,
                    bool* is_italic,
                    bool* is_underlined,
                    bool* is_monospace,
                    bool* is_serif,
                    bool* is_smallcaps,
                    int* pointsize,
                    int* font_id).

Actually you can see it yourself from here.

edited Jul 24 '17 at 09:41

Vishal Gupta

805
14
15

answered Mar 07 '11 at 11:49

zkunov

3,362
1
20
17

New url: https://github.com/tesseract-ocr/tesseract/blob/3.01/api/resultiterator.h#L95 – Daniel P Dec 28 '15 at 22:44
This is only available in tesseract 3. And it's not very reliable there unfortunately. – John Sims Aug 09 '23 at 20:20

score 3 · Answer 2 · answered May 14 '11 at 23:46

3

The Tesseract 3.0x's XML-based hOCR format includes character attributes. You may want to try that.

http://code.google.com/p/tesseract-ocr/issues/detail?id=377#c5

answered May 14 '11 at 23:46

nguyenq

8,212
1
16
16

Can I use OCR to detect font style (bold, italic)?

2 Answers2

Linked