missing data in hmmlearn from scikit-learn

Question

i'm running a simple HMM using scikit-learn's hmmlearn module. it works for fully observed data, but it fails when i pass it observations with missing data. small example:

import numpy as np
import hmmlearn
import hmmlearn.hmm as hmm

transmat = np.array([[0.9, 0.1],
                     [0.1, 0.9]])
emitmat = np.array([[0.5, 0.5],
                    [0.9, 0.1]])

# this does not work: cannot have missing data
obs = np.array([0, 1] * 5 + [np.nan] * 5)

# this works
#obs = np.array([0, 1] * 5 + [1] * 5)

startprob = np.array([0.5, 0.5])
h = hmm.MultinomialHMM(n_components=2,
                       startprob=startprob,
                       transmat=transmat)
h.emissionprob_ = emitmat
print obs, type(obs)
posteriors = h.predict_proba(obs)
print posteriors

if obs is fully observed (every element is 0 or 1) it works but i would like to get estimates for unobserved data points. i tried encoding these as np.nan or None but neither works. it gives the error IndexError: arrays used as indices must be of integer (or boolean) type (in hmm.py", line 430, in _compute_log_likelihood).

how can this be done in hmmlearn?

Sergei Lebedev · Accepted Answer · 2016-01-27T14:13:22.760

1

Currently there's no way of doing missing data imputation using hmmlearn.

As an ad hoc approach you can partition the observation sequence into fully observed subsequences and then for each subsequence either pick the most likely next state and observation or just simulate them randomly from the transition and emission probabilities. Note that this strategy can lead to inconsistencies on the subsequence boundaries.

edited Jan 27 '16 at 14:13

answered Jan 27 '16 at 14:08

Sergei Lebedev

2,659
20
23

adding inference of next most likely sequence would be really helpful are there plans to add it? – mvd Jan 27 '16 at 21:05
At the moment we're focused on filling the missing parts for the 0.2.0 release. Feel free to open an [issue](https://github.com/hmmlearn/hmmlearn). – Sergei Lebedev Jan 28 '16 at 21:54

missing data in hmmlearn from scikit-learn

1 Answers1