chinese character recognition using Tesseract OCR

Question

I have been using Tesseract 3.0.2 OCR SDK for image text extraction. But if I use Chinese text images and pass through OCR then Tesseract doesn't provide me the Chinese characters instead of that I am getting numeric and english characters. But I need Chinese characters as displayed in the image I am using.

How can I achieve this? Is there any way I can obtain Chinese characters rather than any other characters?

Alok Singh · Accepted Answer · 2016-02-24T08:51:48.403

21

You need to download chinese trained data (it will be a file like chi_sim.traineddata) and add it to your tessdata folder.

To download the file https://github.com/tesseract-ocr/tessdata/raw/master/chi_sim.traineddata

and use like this

Tesseract* tesseract= [[Tesseract alloc] initWithDataPath:@"tessdata" language:@"chi_sim"];

if you have any problem you can download my experiment with tessaract (with chinese language support) from https://github.com/aryansbtloe/ExperimentWithTesseract.git

I have tested this one...Hope you will find this useful.

edited Feb 24 '16 at 08:51

answered May 16 '13 at 08:43

Alok Singh

896
6
18

Alok, I tried your sample and it works well on about half of simplified Chinese characters I tried. For the rest it may either recognize a compound character as several different characters each representing a component in the compound character, or totally wrong. Do you know of any method to improve the accuracy of recognition? – CodeBrew Jun 14 '14 at 22:11
1

New trained data link is https://github.com/tesseract-ocr/tessdata/raw/master/chi_sim.traineddata – Régis B. Feb 19 '16 at 16:38
download installer from github.com/UB-Mannheim/tesseract/wiki, so to have a tessdata folder. (in addition to pip install pytesseract) – Mark K May 16 '20 at 08:47

chinese character recognition using Tesseract OCR

1 Answers1

Linked