Why use hidden Markov model vs. Markov model in Baum Welch algorithm

Question

So I am trying to build the Baum Welch algorithm to do parts of speech tagging for practice. However, I am confused about using a hidden Markov Model vs. a Markov Model. Since it seems that you are losing context moving from state to state. Since the output of the last state isn't taken into account when moving to the next state. Is it just to save memory?

edit: added an example for clarity

For example if two states, A and B output a 0 or 1 there will be 4 state transitions and 2 obseravation possibilities for each state, which can be can be made into 8 transitions if you mix each pair of incoming transitions with it's state's obseravation probabilities. But my hang up is why not initially train a machine with four states {(A,1),(B,1),(A,2),(B,2)} with 16 transitions. I am quite new to nlp so I wondering if I am unaware of some algorithmic redundancy that hard to see without harder math.

Since it seems that one loses the information of what the transitions will be when of the last A was 1 vs. 2. But I'm wondering if the training algorithms might not need that information.

https://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm

Thanks for the information.

If the original two-state model is a subset of the four-state model, so that for any set of two-state model parameters there is a way to set the parameters of the four-state model which makes it behave exactly like the two-state model, then the only issue is that if reality behaves according to the two-state model then you will need more data to fit the four-state model to the same accuracy. You could use cross-validation or https://en.wikipedia.org/wiki/Likelihood-ratio_test to tell if the extra parameters of the four-state model are helping. — mcdowella, May 09 '17 at 16:38

score 1 · Accepted Answer · answered May 09 '17 at 04:33

It's not just to save memory, it's to provide a better model of what is really going on. In the case of text, you believe that there is an underlying grammar which says that this word is being used as a noun, and that word is being used as a verb, but you don't get labels that say this and it isn't always obvious from the data. E.g. - looking at what I have just typed, better is an adjective in "a better model" but if I am using stack overflow to better myself, I have just used better as a verb. So whether better is an adjective or a verb is a hidden state and a realistic model will reflect this.

Hidden markov models are also flexible enough that if you really don't have any hidden state, you can create a degenerate sort of hidden markov model that reflects this. For instance, if every hidden state can produce only one possible output, and no two hidden states can ever produce the same output, then you have a hidden markov model in which you can always predict the so-called hidden state from the output and vice versa. It will be very easy to fit the parameters of this, but it probably won't be as good at modelling reality as a proper hidden markov model.

Sorry wasn't completely clear I added an example for clarity. But what I meant was why not combine the hidden states with the output states to make four different states during training so that the transition from (A,1) to (B,1) and (A,2) to (B,1) won't be the same and (A,1) and (A,2) will be separate states. With hidden states A and B and observations 1 and 2. — I. Cantrell, May 09 '17 at 16:34
I think I added a comment to your modified question to answer something like this as best I could. — mcdowella, May 09 '17 at 17:16

Why use hidden Markov model vs. Markov model in Baum Welch algorithm

1 Answers1