Dataset Language identification

Question

I am working on a text classification problem with a multilingual dataset. I would like to know how the languages are distributed in my dataset and what languages are these. The number of languages might be approximately 8-12. I am considering this language detection as a part of the preprocessing. I would like to figure out the languages in order to be able to use the appropriate stop words and see how less data in some of the given languages could affect the occuracy of the classificatin.

Is langid.py or simple langdetect suitable? or any other suggestions?

Thanks

score 0 · Answer 1 · answered May 15 '20 at 12:36

The easiest way to identify the language of a text is to have a list of common grammatical words of each language (pretty much your stop words, in fact), take a sample of the text and count which words occur in your (language-specific) word lists. Then sum them up and the word list with the largest overlap should be the language of the text.

If you want to be more advanced, you can use n-grams instead of words: collect n-grams from a text you know the language of, and use that as a classifier instead of your stop words.

score 0 · Answer 2 · answered May 15 '20 at 14:41

You could use any transformer-based model trained on multiple languages. For instance, you could use XLM-Roberta which is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does not require lang tensors to understand which language is used (which is good in your case), and should be able to determine the correct language from the input ids. Besides like any other transformer based model, it comes with its tokenizer so you could jump the preprocessing part.

You could use the Huggingface library to use any of these models.

Check the XLM Roberta Huggingface documentation here

Dataset Language identification

2 Answers2