1

I am evaluating a model on NarritiveQA story task, and the metrics that were given are Rouge, BLEU-1/4, METEOR ?. What is the standard practice for evaluating on datasets? Do I average the rouge score across the documents or per question?

evaluator = RougeL(multiref="best", alpha=0.5)
evaluator.update(([predicted_response], [references]))

I'm using this right now and updating after every quesioton, the metric is imported from Pytorch-ignite.

user67275
  • 1
  • 9
  • 38
  • 64

0 Answers0