I have a sequence tagging model that predicts a tag for every word in an input sequence (essentially named entity recognition). Model structure: Embeddings layer → BiLSTM → CRF
So essentially the BiLSTM learns non-linear combinations of features based on the token embeddings and uses these to output the unnormalized scores for every possible tag at every timestep. The CRF classifier then learns how to choose the best tag sequence given this information.
My CRF is an instance of the keras_contrib crf, which implements a linear chain CRF (as does tensorflow.contrib.crf). Thus it considers tag transition probabilities from one tag to the next but doesn't maximize the global tag sequence (which a general CRF would).
The default activation function is 'linear'. My question is, why is it linear, and what difference would other activations make?
I.e., is it linear because it's decisions are essentially being reduced to predicting the likelihood of tag yt given tag y-1 (which could possibly be framed as a linear regression problem)? Or is it linear for some other reason, e.g. giving the user flexibility to apply the CRF wherever they like and choose the most appropriate activation function themselves?
For my problem, should I actually be using softmax activation? I already have a separate model with a similar but different structure: Embeddings → BiLSTM → Dense with softmax. So I if I were to use softmax activation in the linear chain CRF (i.e. in the Embeddings layer → BiLSTM → CRF I mentioned at the start of this post), it sounds like it would be nearly identical to that separate model except for being able to use transition probabilities from yt-1 to yt.