What are some good and widely used evaluation metrics to test the accuracy of extractive text summarization methods?

Question

I am using a classification technique for multi document extractive text summarization. I have calculated f-measure, recall, precision and accuracy. What will be the ideal metric for my purpose here to evaluate the summaries generated by this method?

This question can be made more appropriate for a stack exchange site by giving some more context about the application problem you are working on, the nature of your data, etc. Outside of this context, people can only give general heuristic advice which you might find in a textbook or searching Google, and these sites are not meant for that kind of open-ended advice. Additionally, this question is more about statistical implications of a metric choice, which is not on-topic for the specific programming-related nature of this site. Better to try stats.stackexchange.com. — ely, Jan 26 '15 at 15:06
I would not recommend migrating as it is. The question is very broad. I think it needs clarification first. What do you mean with `metric` in this context? Are you asking which of the scores `f-measure`, `recall`, `precision` and `accuracy` you should use to evaluate your predictions? — cel, Jan 26 '15 at 15:14
Sorry for the open ended question. To clarify, I want to know what values I can calculate in order to get an idea about the the quality of my summary. Some people use [ROUGE](http://www.berouge.com) for this. Are there any more such generic metrics that can be calculated to compare quality of summaries generated using different methods? — Explorer, Jan 26 '15 at 19:04

score 2 · Answer 1 · answered Mar 26 '15 at 02:37

2

ROUGE calculates Recall, Precision and F-measure for a variety of metrics: ROUGE-N, ROUGE-L, ROUGE-W, ROUGE-S. Here is the paper for ROUGE.

ROUGE-N is the number of matching n-grams divided by the total number of n-grams.

ROUGE-L looks at the longest common subsequences of two texts, a subsequence can contain gaps so that 1,3,5 is a subsequence of 1,2,3,4,5.

ROUGE-W also uses longest common subsequence as a score but gives a higher weight to subsequences with less gaps.

ROUGE-S uses skip-bigrams, a skip-bigram is 2-gram that can contain any 2 words as long as they are in sentence order i.e do not have to be consecutive.

answered Mar 26 '15 at 02:37

jksnw

648
1
7
19

if my gold(reference) summary is human written and has words which are synonyms of the ones in the system generated summary, will ROUGE consider that? – Explorer Mar 26 '15 at 11:33
No, ROUGE will not consider synonyms, it does not apply [lemmatisation](http://en.wikipedia.org/wiki/Lemmatisation). There is an option for [stemming](http://en.wikipedia.org/wiki/Stemming) though. – jksnw Mar 26 '15 at 11:46
Is there anything parallel to ROUGE? If ROUGE does not apply lemmatisation, how is it that the community has widely accepted it? – Explorer Oct 01 '15 at 10:45

What are some good and widely used evaluation metrics to test the accuracy of extractive text summarization methods?

1 Answers1