language detection

Question

I am using tesseract for OCR, mainly on invoices. However, tesseract requires to specify the language before it starts processing a file.

I thought I am going to perform ocr based on a predefined default language. Then I'd like use the resulting text to check which language is used. If it is not the default language, I process it again in order to get a better result from tesseract.

But how can I implement a language detection algorithm? Is there a C++ library I could use?

Abhishek Jain · Answer 1 · 2012-10-09T07:47:54.627

I am not sure if this would help as the library is in Java. But I found it really cool as it is able to detect around 50 languages from the given text and with a pretty good precision level. You may like to have a look at it and as it is open source, you may rewrite the code in C++ and give it back to the open source community if your application requires to be written only in C++.

Here is the link for the same:

http://code.google.com/p/language-detection/

Note: It uses the Apache Nutch and Tika libraries for analysis.

score 3 · Accepted Answer · answered Nov 18 '11 at 02:38

3

This paper "Natural Language Identification for OCR Applications" describes techniques involved in identification tasks similar to your requirements.

answered Nov 18 '11 at 02:38

nguyenq

8,212
1
16
16

score 0 · Answer 3 · answered Jan 25 '18 at 17:35

0

You might want to read my paper The WiLI benchmark dataset for written language identification and try lidtk.

TL;DR: Give CLD-2 a try.

answered Jan 25 '18 at 17:35

Martin Thoma

124,992
159
614
958

language detection

3 Answers3