I am working on a text classification problem with a multilingual dataset. I would like to know how the languages are distributed in my dataset and what languages are these. The number of languages might be approximately 8-12. I am considering this language detection as a part of the preprocessing. I would like to figure out the languages in order to be able to use the appropriate stop words and see how less data in some of the given languages could affect the occuracy of the classificatin.
Is langid.py or simple langdetect suitable? or any other suggestions?
Thanks