Are the feature vectors for tokens generated as the labels for previous tokens obtained?
No, CRF optimizes loss jointly, there is no left-to-right processing like in MEMM where you predict a label and then use it. CRF takes in account all possible previous labels and finds a most likely sequence.
Have I misunderstood that CRF allows the usage of predicted previous label as a feature for the next token?
CRF allows to use previous labels as features; most likely it already happens automatically in your case. I don't have experience with Mallet, but in most out-of-box linear-chain CRF packages there are 2 kinds of features:
- "state features". These are per-token features user defines; they can use any information from the input sequence (e.g. current and previous token, last 3 letters of the current token, etc.) Each state feature is usually conditioned on the current output label.
- "transition features". In the most common 1-st order linear-chain CRF it is the current label conditioned on a previous label. Usually these features are generated automatically, for all possible label pairs.
Sometimes you can also condition (2) transition features on user-defined features which depend on a current token. It seems this is what you're looking for, but I'm not sure. Some packages implement this (e.g. wapiti), some don't (e.g. crfsuite). Some packages allow to define arbitrary CRFs, and use arbitrary features (e.g. pystruct, factorie, GRMM(?)). Sorry, I don't have experience with Mallet, so it is not really an answer :)