How to use hmmlearn to classify English text?

Question

I want to implement a classic Markov model problem: Train MM to learn English text patterns, and use that to detect English text vs. random strings.

I decided to use hmmlearn so I don't have to write my own. However I am confused about how to train it. It seems to require the number of components in the HMM, but what is a reasonable number for English? Also, can I not do a simple higher order Markov model instead of hidden? Presumably the interesting property is is patterns of ngrams, not hidden states.

score 0 · Answer 1 · answered Apr 11 '17 at 20:25

hmmlearn is designed for unsupervised learning of HMMs, while your problem is clearly supervised: given examples of English and random strings, learn to distinguish between the two. Also, as you've correctly pointed it out, the notion of hidden states is tricky to define for text data, therefore for your problem plain MMs would be more appropriate. I think you should be able to implement them in <100 lines of code in Python.

How to use hmmlearn to classify English text?

1 Answers1