The compute_metrics
function can be passed into the Trainer
so that it validating on the metrics you need, e.g.
from transformers import Trainer
trainer = Trainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=validation_dataset,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
trainer.train()
I'm not sure if it works out of the box with the code to process the train_dataset
and validation_dataset
in the course code https://huggingface.co/course/chapter7
But this ones shows how the Trainer
+ compute_metrics
work https://huggingface.co/course/chapter3/3
Before proceeding to read the rest of the answer, here's some disclaimers:
And now, here goes...
Firstly, lets take a look at what the evaluate
library is/does
From https://huggingface.co/spaces/evaluate-metric/squad
from evaluate import load
squad_metric = load("squad")
predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}]
references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
results = squad_metric.compute(predictions=predictions, references=references)
print(results)
[out]:
{'exact_match': 100.0, 'f1': 100.0}
Next, we take a look at what the compute_metrics
argument in the Trainer
expects
From Line 600 https://github.com/huggingface/transformers/blob/main/examples/pytorch/question-answering/run_qa.py
metric = evaluate.load("squad_v2" if data_args.version_2_with_negative else "squad")
def compute_metrics(p: EvalPrediction):
return metric.compute(predictions=p.predictions, references=p.label_ids)
# Initialize our Trainer
trainer = QuestionAnsweringTrainer(
model=model,
args=training_args,
train_dataset=train_dataset if training_args.do_train else None,
eval_dataset=eval_dataset if training_args.do_eval else None,
eval_examples=eval_examples if training_args.do_eval else None,
tokenizer=tokenizer,
data_collator=data_collator,
post_process_function=post_processing_function,
compute_metrics=compute_metrics,
)
The compute_metrics
argument in the QuestionAnsweringTrainer
is expecting a function that:
- [in]: Takes in an
EvalPrediction
object as input
- [out]: Returns a dict of keys-value pairs where the key is the name of the output metric in string type and the value is expected to a floating point
Un momento! (Wait a minute!) What are these QuestionAnsweringTrainer
and EvalPrediction
objects?
Q: Why are you not using the normal Trainer
object?
A: The QuestionAnsweringTrainer
is a specific sub-class of the Trainer object that is used for the QA task. If you're going to train a model to evaluate on the SQUAD dataset, then the QuestionAnsweringTrainer
is the most appropriate Trainer
object to use.
[Suggestion]: Most probably HuggingFace devs and dev-advocate should add some notes on the object in QuestionAnsweringTrainer
https://huggingface.co/course/chapter7/7?fw=pt
Q: What is this EvalPrediction
object then?
A: Officially, I guess it's this: https://discuss.huggingface.co/t/what-does-evalprediction-predictions-contain-exactly/1691/5
If we look at the doc: https://huggingface.co/docs/transformers/internal/trainer_utils and the code, it looks like the object is a custom container class that holds the (i) predictions, (ii) label_ids and (iii) inputs np.ndarray
. These are what the model's inference function need to return in order for the compute_metrics
to work as expected.
class EvalPrediction:
"""
Evaluation output (always contains labels), to be used to compute metrics.
Parameters:
predictions (`np.ndarray`): Predictions of the model.
label_ids (`np.ndarray`): Targets to be matched.
inputs (`np.ndarray`, *optional*)
"""
def __init__(
self,
predictions: Union[np.ndarray, Tuple[np.ndarray]],
label_ids: Union[np.ndarray, Tuple[np.ndarray]],
inputs: Optional[Union[np.ndarray, Tuple[np.ndarray]]] = None,
):
self.predictions = predictions
self.label_ids = label_ids
self.inputs = inputs
def __iter__(self):
if self.inputs is not None:
return iter((self.predictions, self.label_ids, self.inputs))
else:
return iter((self.predictions, self.label_ids))
def __getitem__(self, idx):
if idx == 0:
return self.predictions
elif idx == 1:
return self.label_ids
elif idx == 2:
return self.inputs
Hey, you still haven't answer the question of how I can use the evaluate.metrics('squad')
directly to the the compute_metrics
args!
Yes, for now, you can't directly use it but it's a simple wrapper.
Step 1. Make sure the model you want to use outputs the required EvalPrediction object that contains, predictions and label_ids
If you're using most the models supported for QA in Huggingface's transformers
library, they should already output the expected EvalPrediction.
Otherwise, take a look at models supported by https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering
Step 2: Since the model inference outputs EvalPrediction
but the compute_metrics expects a dictionary outputs, _you have to wrap the evaluate.metrics
function
E.g.
metric = evaluate.load("squad_v2" if data_args.version_2_with_negative else "squad")
def compute_metrics(p: EvalPrediction):
return metric.compute(predictions=p.predictions, references=p.label_ids)
Q: Do we really always need to write that wrapper function?
A: For now, yes, it is by design not directly integrated with the outputs of the evaluate.metrics
to give the different metrics' developers freedom to define how they want their inputs/outputs to look like.
But there might be hope to make compute_metrics
more integrated with evaluate.metric
if someone picks this feature request up! https://discuss.huggingface.co/t/feature-request-adding-default-compute-metrics-to-popular-evaluate-metrics/33909/3