This question is for the case of homogeneous discrete HMM's.
In the regular HMM's, the probability of the current state depends only on the previous state, that is Pr(S_t|S_1,S_2,...,S_(t-1)) = Pr(S_t|S_(t-1)), and the probability of an output observation depends on the current state, that is Pr(O_t|O_1,...,O_(t-1),S_1,...,S_t) = Pr(O_t|S_t). Then, we can use the Forward-Backward (Baum-Welch) algorithm to estimate the transition and emission probabilities.
My question is about the case when the current observation depends on the current state and the previous observation, that is Pr(O_t|O_1,...,O_(t-1),S_1,...,S_t) = Pr(O_t|O_(t-1),S_t). How to train a model like that? I was thinking about using the same Baum-Welch algorithm, but instead of having an M emission probabilities for each state (representing M possible outputs), there would be MxM emission probabilities. I mean the emission probabilities for each state would be a 2D square matrix, where for example rows represent the observation at the previous state and the columns represent the observation at the current state.
Is this valid? Any other ideas or citations to papers addressing this problem? I searched for papers studying such case, but unfortunately did not find any.