I am evaluating a model on NarritiveQA story task, and the metrics that were given are Rouge, BLEU-1/4, METEOR ?. What is the standard practice for evaluating on datasets? Do I average the rouge score across the documents or per question?
evaluator = RougeL(multiref="best", alpha=0.5)
evaluator.update(([predicted_response], [references]))
I'm using this right now and updating after every quesioton, the metric is imported from Pytorch-ignite.