I have recently read about Bert and want to use BertForMaskedLM for fill_mask task. I know about Bert architecture. Also, as far as I know, BertForMaskedLM is built from Bert with a language modeling head on top, but I have no idea about what language modeling head means here. Can anyone give me a brief explanation.
2 Answers
The BertForMaskedLM, as you have understood correctly uses a Language Modeling(LM) head .
Generally, as well as in this case, LM head is a linear layer having input dimension of hidden state (for BERT-base it will be 768) and output dimension of vocabulary size. Thus, it maps to hidden state output of BERT model to a specific token in the vocabulary. The loss is calculated based on the scores obtained of a given token with respect to the target token.

- 6,346
- 2
- 31
- 59
-
given vocab size is 30.000. so the linear layer here is a linear transformation like Ax + b with x of shape 768 x 1, and A of 30.000 x 768? any activation function here? – Đặng Huy Apr 14 '21 at 19:32
-
Yes, you have got it right. The source code of huggingface shows no activation function after the linear layer.... – Ashwin Geet D'Sa Apr 14 '21 at 19:39
-
Thank you very much. By the way, do you know any model developed from Bert which is used for task (*): "filling word in a blank for a passage". I did know BERT is a pre-trained model on two tas: MASK LM (**) and NSP. ARE task ** and task * all the same?. Or there are really way (any project, paper,..) to fit with task * ( I mean 'passage' here maybe a long passage with several blank to fill in). Thanks in advance. – Đặng Huy Apr 14 '21 at 19:50
-
BERT model is good enough if you just have to fill in a single word (MASKing a single token)... However, if you want multiple words to be filled in a single masked position, you can take a look at BART, which is trained on text-infilling objective. – Ashwin Geet D'Sa Apr 14 '21 at 19:52
-
https://huggingface.co/transformers/model_doc/bart.html#mask-filling – Ashwin Geet D'Sa Apr 14 '21 at 19:53
-
I mean I need to fill in several masked position in a long passage (and assume that each masked position just fit with a single word). It is like: "I want to [MASK] at an expensive restaurant. The [MASK] there is often delicious". – Đặng Huy Apr 14 '21 at 20:06
-
BART should be suitable for this task. – Ashwin Geet D'Sa Apr 14 '21 at 20:06
Additionally to @Ashwin Geet D'Sa's answer. Here is the Huggingface's LM head definition:
The model head refers to the last layer of a neural network that accepts the raw hidden states and projects them onto a different dimension.
You can find the Huggingface's definition for other terms at this page https://huggingface.co/docs/transformers/glossary

- 17
- 4
-
Please edit your answer to include the relevant parts from the linked HuggingFace page -- that way, if the link ever breaks, people are still able to use your answer as a potential source of help. – Kyle F Hartzenberg Feb 16 '23 at 00:08
-
That's slightly better, but if that webpage ceases to exist in future, then people will no longer be able to click on the link to get the information. Instead, what is best practice is to quote the relevant section(s) in your answer on Stackoverflow – Kyle F Hartzenberg Feb 18 '23 at 07:16