Tesseract - How to extract text from the image for the input coordinates?

Question

I need to input image and coordinates. The text present in the input coordinate must be read as output. How to do this using node-tesseract?

Pang Ho Ming · Answer 1 · 2016-12-19T04:29:10.570

5

You need to look into the .hocr file returned from Tesseract(You can google hocr for more info first). The .hocr includes all the bounding box of the text(x, y, width, height, language etc.). Then calculate all boxes locate inside the coordinates you get from input.

Reference: http://gamemath.com/2011/09/detecting-whether-two-boxes-overlap/

Update:

I did some researches for you. Here you are the "best" (most stars) github repo in Javascript you can find on Github

https://github.com/search?utf8=✓&q=tesseract+language%3Ajavascript

and the best one is tesseract.js with over 10000 stars and still having commits recently

https://github.com/naptha/tesseract.js

the part I highlighted is .hocr (tesseract.js named it html)

edited Dec 19 '16 at 04:29

answered Dec 19 '16 at 03:54

Pang Ho Ming

1,299
10
29

Is the .hocr file present when we use node-tesseract also? (package obtained from npm) How to access it? – Amy Dec 19 '16 at 04:19
updated my answer, I do not write nodejs and use node-tesseract, so cant give you answer on this. – Pang Ho Ming Dec 19 '16 at 04:36

score 0 · Answer 2 · answered May 22 '17 at 12:44

0

I know this is an old thread, however I had the same requirement, couldn't find a solution so I've modified the module and posted onto Git:

https://github.com/desmondmorris/node-tesseract/issues/46

answered May 22 '17 at 12:44

SPlatten

5,334
11
57
128

Tesseract - How to extract text from the image for the input coordinates?

2 Answers2