2

I have a list of reviews, each element of the list is a review of IMDB data set in kaggle. there are 25000 reviews in total. I have the label of each review +1 for positive and -1 for negative.

I want to train a Hidden Markov Model with these reviews and labels.

1- what is the sequence that I should give to HMM? is it something like Bag of words or is it something else like probabilities which I need to calculate? what kind of feature extraction method is appropriate? I was told to use Bag of words on review's list, but when I searched a little I find out HMM cares about the order but bag of words doesn't maintain the order of words in sequences. how should I prepare this List of reviews to be able to feed it into a HMM model?

2- is there a framework for this? I know hmmlearn, and I think I should use the MultinomialHMM, correct me if I'm wrong. but it is not supervised, its models do not take labels as input when i want to train it, and I get some funny errors which I don't know how to solve because of the first question I asked about the correct type of input I should give to it. seqlearn is the one I find recently, is it good or there is a better one to use?

I appreciate any guidance since I have almost zero knowledge about NLP.

leo
  • 802
  • 6
  • 15
  • HMMs are used when you need to assign one label for each item in a sequence. In sentiment analysis, you assign a single label to the whole sequence (the review), so HMMs are not very appropriate for this task. Instead, you can turn to a Naive Bayes classifier [(as in this blog post)](https://medium.com/@martinpella/naive-bayes-for-sentiment-analysis-49b37db18bf8). Both HMMs and Naive Bayes can be learned either in a supervised setting or in an unsupervised setting (you specify the number of labels, and usually use the Expectation-Maximization algorithm to learn them without supervision). – mcoav Nov 10 '18 at 12:05
  • indeed. that was what I find out too, you gave label to each item in a sequence, but this is a project for my class and I must use HMM I can't use anything else.I know how HMM works in a abstract level, but I can't map my little knowledge of HMM to this problem. thanks for feedback – leo Nov 10 '18 at 12:24

1 Answers1

1

I was able to do it somehow with surprisingly good accuracy, yet I am not sure what happened exactly, I used seqlearn framework which has a sad documentation. I really suggest to use MATLAB instead of python for HMM.

I used sklearn TfidfVectorizer for feature extraction, then I did this:

vectorizer = TfidfVectorizer(norm=None)
x_train = vectorizer.fit_transform(train_review)
x_test = vectorizer.transform(test_review)

len_train_seq = np.array([[1,1]]*(len(train_review)/2))
len_test_seq = np.array([1]*len(test_review))

model = seqlearn.hmm.MultinomialHMM()
HMM_Classifier = model.fit(x_train, Y, lengths = len_train_seq)
y_predict = HMM_Classifier.predict(x_test, lengths=len_test_seq)

I still would appreciate if a knowledgable person about HMM gives a more robust and clean guideline about doing sentiment analysis with HMM.

leo
  • 802
  • 6
  • 15