Extract text from image in java using tika library

Question

I need to extract text from image so i found few OCR library

Which didn't worked so I move to apache tika.

In apacke tika , I tried with both ImageParser and JpegParser . It is giving file info but not providing text in my image file.

Did you [try reading the Apache Tika documentation on performing OCR](https://wiki.apache.org/tika/TikaOCR)? If yes, where did you get stuck? If not why not? And what happens when you do? — Gagravarr, Apr 16 '16 at 18:21
Yes I read tika documentation. And code setup is working fine but Jpeg parser is returning text from some images but not from that one which I am have to extract out. — Ajay Yadav, Apr 17 '16 at 03:38

score 3 · Answer 1 · answered Apr 25 '16 at 19:38

You can also run tika from the command line. Run it on just the images you want to perform OCR on:

java -jar ./tika-app/target/tika-app-1.13-SNAPSHOT.jar -t ~/Desktop/tess.png

Tika uses tesseract internally to perform OCR. So you should have that installed and on your PATH.

score 1 · Answer 2 · edited May 23 '17 at 12:06

1

For Image processing Tessaract is the best api, which provides some methods along with java, try it once. You can find more detailshere

edited May 23 '17 at 12:06

Community

answered Apr 16 '16 at 10:11

I am using tessaract over linux. It is able to extract out text from image but it is missing some characters and instead of some characters it is considering it as special character. – Ajay Yadav Apr 17 '16 at 04:04
improve accuracy with whitelist of characters as described in http://pretius.com/using-tesseract-ocr-to-extract-scanned-invoice-data-in-java-application/ – Balayesu Chilakalapudi Apr 17 '16 at 07:48

2 Answers2