2

I have a sequence tagging model that predicts a tag for every word in an input sequence (essentially named entity recognition). Model structure: Embeddings layer → BiLSTM → CRF

So essentially the BiLSTM learns non-linear combinations of features based on the token embeddings and uses these to output the unnormalized scores for every possible tag at every timestep. The CRF classifier then learns how to choose the best tag sequence given this information.

My CRF is an instance of the keras_contrib crf, which implements a linear chain CRF (as does tensorflow.contrib.crf). Thus it considers tag transition probabilities from one tag to the next but doesn't maximize the global tag sequence (which a general CRF would).

The default activation function is 'linear'. My question is, why is it linear, and what difference would other activations make?

I.e., is it linear because it's decisions are essentially being reduced to predicting the likelihood of tag yt given tag y-1 (which could possibly be framed as a linear regression problem)? Or is it linear for some other reason, e.g. giving the user flexibility to apply the CRF wherever they like and choose the most appropriate activation function themselves?

For my problem, should I actually be using softmax activation? I already have a separate model with a similar but different structure: Embeddings → BiLSTM → Dense with softmax. So I if I were to use softmax activation in the linear chain CRF (i.e. in the Embeddings layer → BiLSTM → CRF I mentioned at the start of this post), it sounds like it would be nearly identical to that separate model except for being able to use transition probabilities from yt-1 to yt.

KMunro
  • 348
  • 4
  • 14

1 Answers1

1

When using Embeddings → BiLSTM → Dense + softmax, you implicitly assume that the likelihood of the tags is conditionally independent given the RNN states. This can lead to the label bias problem. The distribution over the tags always needs to sum up to one. There is no way to express that the model is not certain about the particular tag does an independent prediction for that.

In a CRF, this can get fixed using the transition scores that the CRF learns in addition to scoring the hidden states. The score for the tag can be an arbitrary real number. If the model is uncertain about a tag, all scores can be low (because they do not have to sum up to one) and predictions from the neighboring tags might help in choosing what tag to chose via the transition scores. The likelihood of the tags is not factorized over the sequence but computed for the entire sequence of tags using a dynamic programming algorithm.

If you used an activation function with a limited range, it would limit what scores can be assigned to the tags and might the CRF not efficient. If you think you need a non-linearity after the RNN, you can add one dense layer with activation of your choice and then do the linear projection.

Jindřich
  • 10,270
  • 2
  • 23
  • 44
  • Thanks @Jindřich! "The likelihood of the tags is not factorized over the sequence but computed for the entire sequence of tags" - is that still true for linear chain CRFs, since they only consider one transition? I thought no. "If you think you (need) a non-linearity after the RNN, you can add one dense layer with activation of your choice and then do the linear projection." So should a CRF (as final classifier) always have a linear transformation? If so, the crf's default activation of 'linear' makes sense. Any resources/tips on if/why an extra dense layer is a good idea? – KMunro Oct 15 '19 at 12:52
  • 1
    It holds even for linear-chain CRF. The original CRF is only about the linear-chain CRF, see eq. 1 here: http://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers – Jindřich Oct 15 '19 at 15:03
  • 1
    I would probably first try without the extra layers. You can try on some papers on named-entity recognition using LSTM-CRF if they do an extra layer or not. – Jindřich Oct 15 '19 at 15:04