I'm attempting to create an adversarially debiased bert masked language model using 'AdversarialBiasMitigator' alongside the AllenNLP pretrained MLM (from here: https://storage.googleapis.com/allennlp-public-models/bert-masked-lm-2020-10-07.tar.gz). The training data I am using is a variation of the WinoBias dataset, edited to work for masked language modelling. The format of this data is a pandas df, with the first column containing the sentences (which already contains [CLS], [SEP], and [MASK] tokens), and the second column containing the target (which is a gendered pronoun). I have edited the masked_language_model_reader.py to correctly read in my pandas df, and I have edited the adversarial_bias_mitigator config file. The remaining files (adversarial_bias_mitigator.py and masked_language_model.py) I have kept the same, so I think the source of the error must be either in the config or the mlm dataset reader I have created.
The main changes I have made in the dataset reader are changing the tokenizer to PretrainedTransformerTokenizer, and editing the _read() method to the following:
@overrides
def _read(self, file_path: str):
import pandas as pd
data= pd.read_csv(file_path)
targets = data.iloc[:,0].tolist()
sentences = data.iloc[:,1].tolist()
zipped = zip(sentences, targets)
for t, s in zipped:
sentence = s
tokens = self._tokenizer.tokenize(sentence)
target = str(t)
t = Token("[MASK]")
yield self.text_to_instance(sentence, tokens, [target])
The rest I have kept virtually the same as the original masked_language_model_reader.py (https://github.com/allenai/allennlp-models/blob/aed4876f04a73c7effddf41b3164e1fb6fb6c275/allennlp_models/lm/masked_language_model_reader.py). I know the above isn't very pythonic but it is the simplest way I could think of, and my dataset isn't that large (only 1000 sentences) so I don't think it is a problem of computing time.
When running all the relevant files in the CLI, the below error appears:
2021-10-02 10:52:20,351 - INFO - allennlp.training.gradient_descent_trainer - Training 0it [00:00, ?it/s] loading instances: 0it [00:00, ?it/s] loading instances: 162it [00:00, 1616.98it/s] loading instances: 324it [00:00, 1545.78it/s] loading instances: 479it [00:00, 1524.23it/s] loading instances: 681it [00:00, 1713.15it/s] loading instances: 1049it [00:00, 1764.63it/s] 0it [00:00, ?it/s] 2021-10-02 10:52:20,959 - CRITICAL - root - Uncaught exception Traceback (most recent call last): File "/usr/local/bin/allennlp", line 8, in sys.exit(run()) File "/usr/local/lib/python3.7/dist-packages/allennlp/main.py", line 46, in run main(prog="allennlp") File "/usr/local/lib/python3.7/dist-packages/allennlp/commands/init.py", line 122, in main args.func(args) File "/usr/local/lib/python3.7/dist-packages/allennlp/commands/train.py", line 121, in train_model_from_args file_friendly_logging=args.file_friendly_logging, File "/usr/local/lib/python3.7/dist-packages/allennlp/commands/train.py", line 187, in train_model_from_file return_model=return_model, File "/usr/local/lib/python3.7/dist-packages/allennlp/commands/train.py", line 260, in train_model file_friendly_logging=file_friendly_logging, File "/usr/local/lib/python3.7/dist-packages/allennlp/commands/train.py", line 504, in _train_worker metrics = train_loop.run() File "/usr/local/lib/python3.7/dist-packages/allennlp/commands/train.py", line 577, in run return self.trainer.train() File "/usr/local/lib/python3.7/dist-packages/allennlp/training/gradient_descent_trainer.py", line 750, in train metrics, epoch = self._try_train() File "/usr/local/lib/python3.7/dist-packages/allennlp/training/gradient_descent_trainer.py", line 773, in _try_train train_metrics = self._train_epoch(epoch) File "/usr/local/lib/python3.7/dist-packages/allennlp/training/gradient_descent_trainer.py", line 490, in _train_epoch batch_outputs = self.batch_outputs(batch, for_training=True) File "/usr/local/lib/python3.7/dist-packages/allennlp/training/gradient_descent_trainer.py", line 383, in batch_outputs output_dict = self._pytorch_model(**batch) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1071, in _call_impl result = forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/allennlp/fairness/adversarial_bias_mitigator.py", line 121, in forward predictor_output_dict = self.predictor.forward(*args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/allennlp_models/lm/models/masked_language_model.py", line 110, in forward embeddings = self._text_field_embedder(tokens) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1071, in _call_impl result = forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 103, in forward token_vectors = embedder(**tensors, **forward_params_values) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1071, in _call_impl result = forward_call(*input, **kwargs) TypeError: forward() got an unexpected keyword argument 'tokens'
I can't seem to work out what the problem is. I can't understand why being passed 'tokens' would be a problem? I'm wondering if it the way I am reading in the data and if it isn't correctly being formatted into an instance, but again, I can't seem to see an obvious problem with my method in comparison to the original script as well. To try to fix the problem I have also added into the config:
"token_indexers": {
"bert": {
"type": "single_id"
}
as well as:
"sorting_keys":["tokens"]
I'm not sure if either of these things are related or helping/worsening the problem!
Thanks for any help.