I have documents that have a text field. The text could be one of dozens of languages, though most are supported by MongoDB (English, Russian, German, French, etc.). There is also a language field which tells MongoDB the language of the document's text field. How does MongoDB handle unsupported languages, like Urdu or Swahili? A post about MDB 2.4 suggests indexing cannot be performed on unsupported languages. An answer to this question suggests that indexing, but not lemmatization, is performed on unsupported languages. For my case, it is fine if no lemmatization is performed.
Asked
Active
Viewed 274 times
2
-
you can encode the language and then store it, to have the indexing. – Mohsen Shakiba Jun 09 '15 at 05:51
-
1But what if the language is something like Arabic that MongoDB does not stem (because the Snowball software it uses does not)? – ZacharyST Jun 09 '15 at 06:14
-
1actually I did a project on arabic a while ago and it stored arabic just fine in the database, no encoding required, and the indexing was working fine too, not sure about the text indexing though. – Mohsen Shakiba Jun 09 '15 at 06:24
-
Yea, I think it indexes but just doesn't lemmatize. – ZacharyST Jun 09 '15 at 21:03