2

I would like to use Mallet on Wikipedia articles in English, Spanish, German, French, Russian and Hindi. It seems to run well on the first five languages, but not Hindi. The results produce Hindi without vowels or the conjoint consonants. Does anyone have any advice?

Also, is there a library of stop-words for other languages?

Thanks

Tom Stieve
  • 31
  • 2

1 Answers1

1

You need to modify the token regular expression. The default regex looks for groups of Unicode letter characters, possibly including punctuation (for example don't or multi-word). These are \p{L} and \p{P} in Java regexes.

South Asian scripts often include Unicode "mark" characters, which are \p{M} in regex. Here's an example using the Hindi Wikipedia article for South Korea:

$ bin/mallet import-file --input hindi.txt --print-output
name: 1
target: Hindi
input: 대한민국(0)=1.0
大韩民国(1)=1.0
सबस(2)=3.0
नगर(3)=2.0
लगत(4)=1.0
एकम(5)=1.0
सकल(6)=2.0
रहव(7)=2.0
यवस(8)=1.0
ययन(9)=1.0
करन(10)=1.0
eps(11)=1.0
करत(12)=1.0

$ bin/mallet import-file --input hindi.txt --print-output --token-regex '[\p{L}\p{M}]+'
name: 1
target: Hindi
input: दक्षिण(0)=4.0
कोरिया(1)=7.0
कोरियाई(2)=4.0
대한민국(3)=1.0
देहान्(4)=1.0
मिन्गुक(5)=1.0
大韩民国(6)=1.0
हंजा(7)=2.0
पूर्वी(8)=1.0
एशिया(9)=2.0
में(10)=7.0
स्थित(11)=2.0
एक(12)=4.0
देश(13)=6.0
...

There's currently no stoplist for Hindi. Looking for words that occur at least once in more than 10% of documents would be a reasonable start.

David Mimno
  • 1,836
  • 7
  • 7
  • David, thank you very much for your answer. I have tried your suggestion, even trying specifically to read Devanagari https://www.regular-expressions.info/unicode.html. I can't seem to get it to work. I've noticed that my command prompt is not reading Hindi or Russian, but it can read German diacritics. I've tried many ways to set it to unicode, but it still doesn't work. I'm working on Windows 7 at my university and I've asked my IT people for help. Do you have any suggestions? – Tom Stieve Feb 14 '18 at 20:25
  • David, I'm sorry, I know you must be busy, but do you have any ideas about how we can resolve this? We have tried Mallet on Windows 7 and 10, but it still doesn't work. Should we try this on Linux? – Tom Stieve Apr 30 '18 at 23:35
  • This sounds like it's likely to be an OS-specific question that would be difficult to debug remotely. If you want to show detailed output for what you saw it might help. – David Mimno May 01 '18 at 16:41