Exact Hidden Markov Model training algorithm

Question

In most cases, the Baum-Welch algorithm is used to train a Hidden Markov model.

In many papers however, it is argued that the BW algorithm will optimize until it got stuck in a local optimum.

Does there exist an exact algorithm that actually succeeds in finding the global optimum (except from enumerating nearly all possible models and evaluating them)?

Of course for most applications, BW will work fine. We are however interested in finding lower bounds of the amount of information loss when reducing the number of states. Therefore we always need to generate the best model possible.

We are thus looking for an efficient NP-hard algorithm (that only enumerates over a (potentially) exponential number of extreme points) and not over a discretized number of floating points for each probability in the model.

Maybe you find the answers to [this question](http://mathoverflow.net/questions/32533/is-all-non-convex-optimization-heuristic) interesting. — ziggystar, May 19 '14 at 08:09
@KhaledAKhunaifer: As far as I know the Viterbi algorithm determines exactly the most likely *hidden state sequence* given an observation sequence and a trained Hidden Markov model? — Willem Van Onsem, May 19 '14 at 08:41
@CommuSoft It's an optimal dynamic programming algorithm, it use n-gram to approximate next-state based on n-previous-states in the sequence as a pattern — Khaled.K, May 19 '14 at 09:26
But since the states are "hidden" one never knows in which state he already was. I'm looking for an algorithm that given a (or more) sequences of *observations* (not states), can *train* a Hidden Markov model such that the error on the given *observations* is globally minimized. — Willem Van Onsem, May 19 '14 at 18:00

mcdowella · Answer 1 · 2014-05-19T17:27:25.243

A quick search finds in http://www.cs.tau.ac.il/~rshamir/algmb/98/scribe/html/lec06/node6.html "In this case, the problem of finding the optimal set of parameters $\Theta^{\ast}$ is known to be NP-complete. The Baum-Welch algorithm [2], which is a special case of the EM technique (Expectation and Maximization), can be used for heuristically finding a solution to the problem. " Therefore I suggest that an EM variant that was guaranteed to find a global optimum in polynomial time would prove P=NP and is unknown and in fact probably does not exist.

This problem almost certainly is not convex, because there will in fact be multiple globally optimal solutions with the same scores - given any proposed solution, which typically gives a probability distribution for observations given the underlying state, you can, for instance, rename hidden state 0 as hidden state 1, and vice versa, adjusting the probability distributions so that the observed behaviour generated by the two different solutions is identical. Thus if there are N hidden states there are at least N! local optimums produced by permuting the hidden states amongst themselves.

On another application of EM, https://www.math.ias.edu/csdm/files/11-12/amoitra_disentangling_gaussians.pdf provides an algorithm for finding a globally optimum gaussian mixture model. It observes that the EM algorithm is often used for this problem, but points out that it is not guaranteed to find the global optimum, and does not reference as related work any version of EM which might (it also says the EM algorithm is slow).

If you are trying to do some sort of likelihood ratio test between e.g. a 4-state model and a 5-state model, it would clearly be embarrassing if, due to local optima, the 5-state model fitted had a lower likelihood than the 4-state model. One way to avoid this or to recover from it would be to start a 5-state EM from a starting point very close to that of the best 4-state models found. For instance, you could create a 5th state with probability epsilon and with an output distribution reflecting an average of the 4-state output distributions, keeping the 4-state distributions as the other 4 distributions in the new 5-state model, multiplying in a factor of (1-epsilon) somewhere so that everything still added up to one.

True, we expected this. We are however looking for an efficient NP-hard algorithm (that only enumerates over a (potentially) exponential number of *extreme points*) and not over a discretized number of floating points for each probability in the model. — Willem Van Onsem, May 19 '14 at 04:51
@CommuSoft The problem is that there doesn't seem to be any reason to believe that the problem solved by B--W is convex. If you're desperate, there are solvers for nonconvex problems, but I would be surprised greatly if they could handle nontrivial instances. — David Eisenstat, May 19 '14 at 14:46
I would be very surprised if the problem is convex. I have added comments on this, and some more info, in as an edit at the end of my answer. — mcdowella, May 19 '14 at 17:28

score 2 · Answer 2 · answered May 19 '14 at 09:52

I think if you really want this, you can, given a local optimum, define a domain of convergence. If you can have some reasonably weak conditions, then you can quickly show that either the whole field is in the domain of convergence, or that there is a second local mininium.

E.g., suppose in an example I have two independent variables (x,y), and one dependent variable (z), and suppose that given a local minimim z_1, and a pair of start points which converge to z_1=(x_1,y_1), P_1 = (x_2, y_1) and p_2 = (x_1, y_3), then i might be able to prove that then all of the triangle z_1, p_1, p_2 is in the domain of convergence.

Of course, this is not an approach which works generally, but you can solve a sub class of problems efficiently.E.g., some problems have no no domain of convergence in a sense, e.g. its possible to ahve a problem where a point converges to a different solution than all the points in its neighbourhood, but lots of problems have some reasonable smoothness to their convergence to a solution, so then you can do ok.

Exact Hidden Markov Model training algorithm

2 Answers2