Retrieve the "relevant tokens" with a BERT model (already fine-tuned)

Question

I already fine-tuned a BERT model ( with the huggingface library) for a classification task to predict a post category in two types (1 and 0, for example). But, I would need to retrieve the "relevant tokens" for the documents that are predicted as category 1 (for example). I know that I can use the traditional TF-IDF approach once I have all the posts labeled as 1 (for example) with my BERT model. But I have the following question: is it possible to do the same task with the architecture of the fine-tunned BERT model? I mean, access to the last layer of the encoder (the prediction layer), and with the attention mechanism, get the "relevant" tokens that make that te prediction are 1 (for example)? Is it possible to do that? Does someone know a tutorial o something similar?

You might want to get the self-attention matrices and find out what tokens contributed the most to the `CLS` token — Jindřich, Mar 30 '21 at 08:20
Thank you for the response. I need to do that for the 12 heads of the encoder and find out what tokens contributed more to [CLS] in each one. — Nicolas Montes, Mar 31 '21 at 12:49

score 2 · Answer 1 · answered Apr 01 '21 at 07:33

With transformer models, you can perform some explainability analysis, which is probably what you want. I would recommend looking at the transformer section of SHAP. You just have to wrap your model in the SHAP explainer, like this:

import shap
explainer = shap.Explainer(model)

There is another option if you have labels on which tokens are relevant, namely training a token classification model. But that would require retraining and labels for each token.

Retrieve the "relevant tokens" with a BERT model (already fine-tuned)

1 Answers1