How to calculate ROUGE for a multi sentence summary?

Question

I am trying to run several summarization metrics(ROUGE, METEOR, BLEU, CIDEr) on the TAC2010 dataset. I used a python package called nlg-eval (https://github.com/Maluuba/nlg-eval) to do this. I tried both API listed on the github: functional API: for the entire corpus

from nlgeval import compute_metrics
metrics_dict = compute_metrics(hypothesis='examples/hyp.txt',
                               references=['examples/ref1.txt', 'examples/ref2.txt'])

functional API: for only one sentence (I stacked all the sentences into one sentence for a hypothesis summary and for its each four references)

from nlgeval import compute_individual_metrics
metrics_dict = compute_individual_metrics(references, hypothesis)

However, the ROUGE-L score I got from the nlg-eval does not align with the offical ROUGE-L score reported by the dataset.

So my question is

What is the right way of calculate ROUGE for a multi-sentence summary
How to get nlg-eval work on TAC2010

There is no difference between ROUGE-L for a single sentence and multiple sentences. I guess the difference can be because of using different tokenization in different implementations of the metric. — Jindřich, Aug 12 '21 at 08:39
How about the other metrics, is stacking all sentences in a summary into one sentence the right way to do when evaluating? — NKWBTB, Aug 12 '21 at 21:37

How to calculate ROUGE for a multi sentence summary?

0 Answers0