You need to modify the token regular expression. The default regex looks for groups of Unicode letter characters, possibly including punctuation (for example don't or multi-word). These are \p{L}
and \p{P}
in Java regexes.
South Asian scripts often include Unicode "mark" characters, which are \p{M}
in regex. Here's an example using the Hindi Wikipedia article for South Korea:
$ bin/mallet import-file --input hindi.txt --print-output
name: 1
target: Hindi
input: 대한민국(0)=1.0
大韩民国(1)=1.0
सबस(2)=3.0
नगर(3)=2.0
लगत(4)=1.0
एकम(5)=1.0
सकल(6)=2.0
रहव(7)=2.0
यवस(8)=1.0
ययन(9)=1.0
करन(10)=1.0
eps(11)=1.0
करत(12)=1.0
$ bin/mallet import-file --input hindi.txt --print-output --token-regex '[\p{L}\p{M}]+'
name: 1
target: Hindi
input: दक्षिण(0)=4.0
कोरिया(1)=7.0
कोरियाई(2)=4.0
대한민국(3)=1.0
देहान्(4)=1.0
मिन्गुक(5)=1.0
大韩民国(6)=1.0
हंजा(7)=2.0
पूर्वी(8)=1.0
एशिया(9)=2.0
में(10)=7.0
स्थित(11)=2.0
एक(12)=4.0
देश(13)=6.0
...
There's currently no stoplist for Hindi. Looking for words that occur at least once in more than 10% of documents would be a reasonable start.