0

I am trying to run several summarization metrics(ROUGE, METEOR, BLEU, CIDEr) on the TAC2010 dataset. I used a python package called nlg-eval (https://github.com/Maluuba/nlg-eval) to do this. I tried both API listed on the github: functional API: for the entire corpus

from nlgeval import compute_metrics
metrics_dict = compute_metrics(hypothesis='examples/hyp.txt',
                               references=['examples/ref1.txt', 'examples/ref2.txt'])

functional API: for only one sentence (I stacked all the sentences into one sentence for a hypothesis summary and for its each four references)

from nlgeval import compute_individual_metrics
metrics_dict = compute_individual_metrics(references, hypothesis)

However, the ROUGE-L score I got from the nlg-eval does not align with the offical ROUGE-L score reported by the dataset.

So my question is

  1. What is the right way of calculate ROUGE for a multi-sentence summary
  2. How to get nlg-eval work on TAC2010
NKWBTB
  • 21
  • 4
  • There is no difference between ROUGE-L for a single sentence and multiple sentences. I guess the difference can be because of using different tokenization in different implementations of the metric. – Jindřich Aug 12 '21 at 08:39
  • How about the other metrics, is stacking all sentences in a summary into one sentence the right way to do when evaluating? – NKWBTB Aug 12 '21 at 21:37

0 Answers0