Most accurate open-source OCR for Japanese?

Question

From your experience, what is the most accurate open-source Optical Character Recognition (OCR) library/software to read Japanese text?

I just tried nhocr, its mistake rate is over 2% even on an extremely clean high-definition document.

For what it's worth, 2% isn't terrible for OCR. We struggle to get that with, uhm, Romaji. — Steven Sudit, Oct 26 '10 at 16:27
2% is for ultra-clean characters in big font. For scanned books it is much worse, let alone handwritten forms. — Nicolas Raoul, Oct 27 '10 at 05:04

score 6 · Accepted Answer · answered Apr 13 '10 at 13:14

6

Based on the lack of answers it sounds like nhocr IS the most accurate open-source OCR for Japanese.

answered Apr 13 '10 at 13:14

Peter

764
7
14

score 3 · Answer 2 · answered Apr 04 '10 at 00:57

3

Haven't tried it myself, but perhaps you should take a look at tesseract.

answered Apr 04 '10 at 00:57

baol

4,362
34
44

Japanese is not available, even as a separate download: http://code.google.com/p/tesseract-ocr/downloads The readme briefly mentions that Japanese has been removed and is available somewhere, but actually it is nowhere to be found :-( http://code.google.com/p/tesseract-ocr/wiki/ReadMe On the mailing list, a user reported some success training Tesseract on 60 Japanese characters, but it is clearly experimental. In conclusion, it might be possible, but in practice nobody uses Tesseract for Japanese. – Nicolas Raoul Apr 05 '10 at 01:18
I don't know Japanese, but the fact that they had a japanese group seemed interesting: http://groups.google.co.jp/group/tesseract-ocr/ (but looking at it it might as well be a japanese version of the international one, sorry if I wasted your time) – baol Apr 05 '10 at 01:52
1

@Nicolas I've opened issue http://code.google.com/p/tesseract-ocr/issues/detail?id=291 about the missing CJK data files – SamB Apr 09 '10 at 16:44
1

@SamB: Thanks! The training files for Japanese seem to be available here: http://code.google.com/p/tesseract-ocr/source/browse/trunk/tessdata/jpn.traineddata?spec=svn309&r=309 . If it is hidden so well, I guess it is not used a lot. – Nicolas Raoul Apr 12 '10 at 01:01
@baol: Indeed, if you replace .co.jp by .com, you can see that the questions/answers are the same. It is just the Google interface that is translated in Japanese. There doesn't seem to be any Tesseract Japanese community. – Nicolas Raoul Apr 12 '10 at 01:03
They seem to be available now: http://code.google.com/p/tesseract-ocr/downloads/list – aehlke Apr 01 '11 at 18:45
1

@Whanfrieden: Yes, I tried it, it is not too bad now! – Nicolas Raoul Apr 20 '11 at 07:13
@Nicolas Raoul how accurate is it compared to nhocr? Also, is it accurate enough to scan clean text like say text bubbles in a manga? – 0x6C38 Jul 11 '17 at 12:45

score 0 · Answer 3 · answered Apr 05 '10 at 16:02

0

I have had some R&D experience with ABBYY's solution - FineReader Engine. It was version 8.1 at the time, and I am not up to date with their newest revisions. But at the time - it was simply the best I could find for our handheld scanner product. I highly recommend it.

BTW, you can get a free version of ABBYY OCR package for end-users when purchasing a XEROX PE220 printer, which it comes bundled with. That printer was on my desk for several years. There must be other printers coming with it bundled inside. Xerox was betting on thei OCR as the best as well.

answered Apr 05 '10 at 16:02

Etamar Laron

1,172
10
23

FineReader is NOT open-source. And the version you were using did NOT support Japanese: http://www.abbyy.com/Default.aspx?DN=b6d671c1-6da6-4bec-8c06-0ad362f6a7e9 – Nicolas Raoul Apr 06 '10 at 05:20
3

Sorry, didn't see the open-source request. It is not open-source. The version I was using had CJK support (Chinese, Japanese and Korean), which is an add-on to the engine. We were using it to demonstrate South-eastern buyers our technology. SEE AT: http://www.ocr.gr/downloads/Engine%208.1%20What's%20New.pdf (copy the URL because SO breaks it) – Etamar Laron Apr 06 '10 at 07:18
@Etamar ABBYY OCR is interesting. Do they allow integration with a custom dictionary, customizing bigrams analysis, etc.? We need to use these techniques to improve the accuracy of the OCR. – amit kumar Jul 12 '11 at 05:01
1

@phaedrus in short - yes. I've been working with their engine for years and could integrate just about anything I wanted. Dictionaries are a basic feature, you can customize them. Cheers for Zen and the Art. – Etamar Laron Jul 31 '11 at 18:29

score -1 · Answer 4 · answered Jun 26 '10 at 08:18

-1

Please try WeOCR. Server version and download version are available.

answered Jun 26 '10 at 08:18

kmugitani

615
1
6
13

If I understand well, WeOCR is just a Web front-end for other OCR engines. In particular, it uses nhocr for Japanese. So I guess it is not more accurate than nhocr, right? – Nicolas Raoul Jun 28 '10 at 03:23
See http://weocr.ocrgrid.org/#todo One of the TODO items is "Develop an OCR for Japanese" and it links to nhocr – Nicolas Raoul Jun 28 '10 at 03:33
1

Yah. That is correct. Just a couple month ago, I tried their online server version. But it was far from accurate. Japanese cellphone. specially Sharp cellphone has quite excellent OCR capability. But I did not find other free OCR software. Of course, Sharp does not sell their OCR software at this point. – kmugitani Jul 02 '10 at 14:14

Most accurate open-source OCR for Japanese?

4 Answers4