18

I have been using Tesseract 3.0.2 OCR SDK for image text extraction. But if I use Chinese text images and pass through OCR then Tesseract doesn't provide me the Chinese characters instead of that I am getting numeric and english characters. But I need Chinese characters as displayed in the image I am using.

How can I achieve this? Is there any way I can obtain Chinese characters rather than any other characters?

piet.t
  • 11,718
  • 21
  • 43
  • 52
Nishant Tyagi
  • 9,893
  • 3
  • 40
  • 61

1 Answers1

21

You need to download chinese trained data (it will be a file like chi_sim.traineddata) and add it to your tessdata folder.

To download the file https://github.com/tesseract-ocr/tessdata/raw/master/chi_sim.traineddata

and use like this

Tesseract* tesseract= [[Tesseract alloc] initWithDataPath:@"tessdata" language:@"chi_sim"];

if you have any problem you can download my experiment with tessaract (with chinese language support) from https://github.com/aryansbtloe/ExperimentWithTesseract.git

I have tested this one...Hope you will find this useful.

Alok Singh
  • 896
  • 6
  • 18
  • Alok, I tried your sample and it works well on about half of simplified Chinese characters I tried. For the rest it may either recognize a compound character as several different characters each representing a component in the compound character, or totally wrong. Do you know of any method to improve the accuracy of recognition? – CodeBrew Jun 14 '14 at 22:11
  • 1
    New trained data link is https://github.com/tesseract-ocr/tessdata/raw/master/chi_sim.traineddata – Régis B. Feb 19 '16 at 16:38
  • download installer from github.com/UB-Mannheim/tesseract/wiki, so to have a tessdata folder. (in addition to pip install pytesseract) – Mark K May 16 '20 at 08:47