I am trying to run several summarization metrics(ROUGE, METEOR, BLEU, CIDEr) on the TAC2010 dataset. I used a python package called nlg-eval (https://github.com/Maluuba/nlg-eval) to do this. I tried both API listed on the github: functional API: for the entire corpus
from nlgeval import compute_metrics
metrics_dict = compute_metrics(hypothesis='examples/hyp.txt',
references=['examples/ref1.txt', 'examples/ref2.txt'])
functional API: for only one sentence (I stacked all the sentences into one sentence for a hypothesis summary and for its each four references)
from nlgeval import compute_individual_metrics
metrics_dict = compute_individual_metrics(references, hypothesis)
However, the ROUGE-L score I got from the nlg-eval does not align with the offical ROUGE-L score reported by the dataset.
So my question is
- What is the right way of calculate ROUGE for a multi-sentence summary
- How to get nlg-eval work on TAC2010