Detecting word boundaries and the language for each word from UTF8 buffer

Question

I'm developing a custom search engine and I need to pass each word to the appropriate language specific stemmer.

I've recently discovered Compact Language Detector (CLD) http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html that provides impressive language detection for a UTF8 buffer.

While CLD is great for detecting the language for a given buffer, I need to extract the word boundaries from the buffer the as well as detect the language for each for these words.

Any ideas?

regex with `\b` special char. – kirilloid May 09 '12 at 12:32 — kirilloid, May 09 '12 at 12:32

score 2 · Answer 1 · answered May 09 '12 at 20:13

2

Good luck :)

Honestly, this is advanced NLP topic and it is very hard to do reliably.

The very first thing is, you cannot detect word boundaries in many languages just like that. Especially in ideographic languages (Chinese, Japanese, ...) you need well-trained learning algorithm for tokenization.
There are some rumors that somebody did that (see Basis Technology), but this is only useful for you if you can afford paying license fee.

BTW. Many words can be written exactly the same in few languages and you won't get reliable language detection on them. And to make matters worse, the algorithm (usually some n-gram based detector) will need several octets to detect anything (right or wrong).

As I said, good luck. If I were you, I'd rethink my strategy ;)

answered May 09 '12 at 20:13

Paweł Dyda

18,366
7
57
79

Dyda: Do you have any experience with Lucene? I noticed Lucene have a buit-in breakiteratior for multilingual text called CompositeBreakIterator. Is it reliable? – ManojMarathayil May 10 '12 at 04:46
@Manoj: Honestly, I haven't played with Apache Lucene myself. All I know about it is you need to normalize the text you feed into it, otherwise you will get the unpredictable results. Also, one of our teams raised concerns about search reliability, but I can't say whether it is valid or not - some serious research would be required. – Paweł Dyda May 10 '12 at 05:46
@PawełDyda I wrote a simple language detector for devenagari scripts. The idea was to accept only `UTF8` encoded data, iterate over each character and decode it to get the code point. Match the code point with the unicode character range and identify which language range it belongs to. While iterating, I ignore joiners and non-joiners. If all the characters are in same language range, then report the language. Iteration stops when we get a code point which is in different range. This works well so far. I was wondering, could this method be used for all the non devanagari languages? – Navaneeth K N Aug 11 '12 at 06:22
@Appu: I see two issues: 1. What if you have mixed script environment, that is you have eg. English words (Latin script) in the sentence written with Devanagari script? It happens all the time, I believe. 2. Some code ranges happens to be used in more than one language, eg. letter "ą" (a with ogonek) could be used in Polish and Lithuanian; most of Cyrillic characters could be used in Russian, Belorussian, Ukrainian, Bulgarian, Macedonian, Serbian and even Mongolian (there are others as well). No way to match the language without statistical language profile = n-grams. – Paweł Dyda Aug 11 '12 at 08:48

score 0 · Answer 2 · answered May 09 '12 at 20:09

0

I've developed a language detection engine using ICU that basically does the following:

Discover basic "words" using ICU BreakIterator and English (Locale::getEnglish()) rules
Feed the words from #1 to my engine which in turn gives me the "true" language(s) sorted by scores

For your purposes since your input is UTF-8, you can use the setText() method taking a UText* (note the example linked here, it's almost exactly what you need -- Though, you may want to use the C++ APIs) which can be set up to traverse UTF-8.

answered May 09 '12 at 20:09

NuSkooler

5,391
1
34
58

The problem with BreakIterator is, it does not work correctly with Asian languages, which is clearly stated in its documentation. Also ICU's language detection reliability is a bit worse than desired (from my research, which I cannot unfortunately share). – Paweł Dyda May 09 '12 at 20:16
I miss read your post. The answer I posted above works great as a pre-process step for 'word' extraction to feed to a language detector (in my case I use a n-gram engine). As far as parsing boundaries for CJK, it's very, VERY complex :) – NuSkooler May 09 '12 at 20:24

Detecting word boundaries and the language for each word from UTF8 buffer

2 Answers2