So I am trying to build the Baum Welch algorithm to do parts of speech tagging for practice. However, I am confused about using a hidden Markov Model vs. a Markov Model. Since it seems that you are losing context moving from state to state. Since the output of the last state isn't taken into account when moving to the next state. Is it just to save memory?
edit: added an example for clarity
For example if two states, A and B output a 0 or 1 there will be 4 state transitions and 2 obseravation possibilities for each state, which can be can be made into 8 transitions if you mix each pair of incoming transitions with it's state's obseravation probabilities. But my hang up is why not initially train a machine with four states {(A,1),(B,1),(A,2),(B,2)} with 16 transitions. I am quite new to nlp so I wondering if I am unaware of some algorithmic redundancy that hard to see without harder math.
Since it seems that one loses the information of what the transitions will be when of the last A was 1 vs. 2. But I'm wondering if the training algorithms might not need that information.
https://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm
Thanks for the information.