I'm developing a custom search engine and I need to pass each word to the appropriate language specific stemmer.
I've recently discovered Compact Language Detector (CLD) http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html that provides impressive language detection for a UTF8 buffer.
While CLD is great for detecting the language for a given buffer, I need to extract the word boundaries from the buffer the as well as detect the language for each for these words.
Any ideas?