0

I have a machine learning problem that I'm trying to solve. I'm using a Gaussian HMM (from hmmlearn) with 5 states, modelling extreme negative, negative, neutral, positive and extreme positive in the sequence. I have set up the model in the gist below

https://gist.github.com/stevenwong/cb539efb3f5a84c8d721378940fa6c4c

import numpy as np
import pandas as pd
from hmmlearn.hmm import GaussianHMM

x = pd.read_csv('data.csv')
x = np.atleast_2d(x.values)

h = GaussianHMM(n_components=5, n_iter=10, verbose=True, covariance_type="full")
h = h.fit(x)
y = h.predict(x)

The problem is that most of the estimated states converges to the middle, even when I can visibly see that there are spades of positive values and spades of negative values but they are all lumped together. Any idea how I can get it to better fit the data?

enter image description here

EDIT 1:

Here is the transition matrix. I believe the way it's read in hmmlearn is across the row (i.e., row[0] means prob of transiting to itself, state 1, 2, 3...)

In [3]: h.transmat_
Out[3]:
array([[ 0.19077231,  0.11117929,  0.24660208,  0.20051377,  0.25093255],
       [ 0.12289066,  0.17658589,  0.24874935,  0.24655888,  0.20521522],
       [ 0.15713787,  0.13912972,  0.25004413,  0.22287976,  0.23080852],
       [ 0.14199694,  0.15423031,  0.25024992,  0.2332739 ,  0.22024893],
       [ 0.17321093,  0.12500688,  0.24880728,  0.21205912,  0.2409158 ]])

If I set all the transition probs to 0.2, it looks like this (if I do average by state the separation is worse).

enter image description here

swmfg
  • 1,279
  • 1
  • 10
  • 18

1 Answers1

1

Apparently, your model learned large variance for state 2. GMM is a generative model trained with max likelihood criteria, so in some sense, you got the optimal fit to the data. I can see it provides meaningful prediction in extreme cases, so if you want it to attribute more observations to classes other than 2, I would try the following:

  1. Data preprocessing. Try to use log values for your input to make the difference between them sharper.
  2. Look at your transition matrix, maybe transition probs from state 2 are too low. Try to set all probabilities to equal and see what happens.
Dmytro Prylipko
  • 4,762
  • 2
  • 25
  • 44
  • 1
    Minor nitpick: the model in question is Gaussian HMM, not Gaussian Mixture Model aka GMM. – Sergei Lebedev Dec 23 '16 at 21:29
  • Thanks. This is what I've already done to the data: 1) the original data has visible cycles but is very noisy, so I've used a kalman filter to smooth it out, parameters were chosen using the EM algorithm; 2) the data you see above is the log difference of the original time series. I've edited the post above following your suggestions. Have a good Christmas – swmfg Dec 24 '16 at 09:51
  • If you shared the code for states visualization, I could play with it and maybe suggest something helpful. Now I can only say that 10 iterations look a bit too little. Also, only 157 data samples... – Dmytro Prylipko Dec 24 '16 at 20:35
  • I've uploaded an excel to the gist, all I've done is chart the two time series. hmmlearn converged after 3 iterations so doesn't look like it's a problem with convergence https://gist.github.com/stevenwong/cb539efb3f5a84c8d721378940fa6c4c#file-data-xlsx – swmfg Dec 28 '16 at 09:21