1

I am trying to perform OCR with tesseract. I can do pdf to text using tesseract java lib as expected. My requirements is extended a bit now. I need to extract metadata based on template form (kind of passport example where we have fix place for first name, date of birth etc). Input could be either pdf or image with same template form.

I am facing hard time to find any such example or article to achieve or to get further help above using tesseract.

So my basic questions :

  1. Is this possible using tesseract?
  2. Is there any example/articles about how to achieve this using tesseract?
  3. Is there any other software/library which is recommended to achieve this?

Thanks for reading this.

Vishal Zanzrukia
  • 4,902
  • 4
  • 38
  • 82
  • I'm not sure what exactly you have an issue with. If you have a template - you can extract ROI which you will feed to tesseract, problem solved? – Dmitrii Z. Oct 19 '18 at 06:16
  • Hi @DmitriiZ : Thanks for your reply. To answer your first question, I am not clear how to approach to achieve above scenario. How can I feed ROI to tesseract? Any example or reference links would be helpful. Thanks again. – Vishal Zanzrukia Oct 19 '18 at 12:41
  • 1
    ROI is region of interest. If you have a template - than you know coordinates of the fields (say, name is in (x,y,width,height)) - therefore you can crop this field from your original image, making a smaller image with the data you need to OCR. Cropping can be done with variety of software [IM example](http://www.imagemagick.org/Usage/crop/) – Dmitrii Z. Oct 19 '18 at 12:46
  • I'm not sure about wrapper you're using, but in c++ API you can also use `setRectangle` function to set ROI (see c++ [API Examples](https://github.com/tesseract-ocr/tesseract/wiki/APIExample)). If you look through your java wrapper's documentation - you will probably find something similar – Dmitrii Z. Oct 19 '18 at 12:49
  • I am Java guy, so I am using Java api wrapper. – Vishal Zanzrukia Oct 19 '18 at 12:49
  • Hi @DmitriiZ : Thanks for direction. I think it's useful as of now. I will try the same – Vishal Zanzrukia Oct 19 '18 at 12:53
  • Hi @DmitriiZ : In case of multi-page pdf, will rectangle work for only first page or next pages as well? – Vishal Zanzrukia Oct 19 '18 at 12:55
  • 1
    I think that would only work for the first page. – Dmitrii Z. Oct 19 '18 at 12:57

0 Answers0