How to evaluate inference accuracy for seq2seq video captionnig model?

Question

I've built a video captionnig model.
It consists of a Seq2seq model, taking video as input and outputting natural language

I obtain really good test results but horrible inference results:

Epoch 1 ; Batch loss: 5.181570 ; Batch accuracy: 60.28% ; Test accuracy: 00.89%
...
Epoch 128 ; Batch loss: 0.628466 ; Batch accuracy: 96.31% ; Test accuracy: 00.81%

Explanation

This accuracy is low because of my accuracy function: it compares word by word the given result with the caption.

This computation is adapted for training, because of teacher's forcing mechanism, but not adapted for inference.

Example

True descriptions:

a football match is going on <end>
the football player are made a goal <end>
the crowd cheers as soccer players work hard to gain control of the ball <end>

Generated descrition:

a group of young men play a game of soccer <end>

My model correctly understands what's happening, but it doesn't express it exaclty (word by word) like the awaited description...
For this specific example, accuracy value will be only 1/31.

How can I wisely compute inference accuracy ?

I thought about extracting the keywords of the sentences. Then trying to see if all keywords contained in the predicted sentence can be found somehere in the captions.
But I also have to check if the sentence is a correct english sentence...

Maybe you think about an easier way to compute accuracy. Tell me !

Arjun Kava · Answer 1 · 2019-08-26T11:51:41.147

1

User Bleu Score aka Bilingual Evaluation Understudy Score to compare hypotheses and references.

def bleu_score(hypotheses, references):
    return nltk.translate.bleu_score.corpus_bleu(references, hypotheses)

Example:

# two references for one document
from nltk.translate.bleu_score import corpus_bleu
references = [[['this', 'is', 'a', 'test'], ['this', 'is' 'test']]]
hypotheses = [['this', 'is', 'a', 'test']]
score = corpus_bleu(references, hypotheses)
print(score)

Output:

1.0

Other methods are:

METEOR
ROUGE_L
CIDEr

Follow: https://github.com/arjun-kava/Video2Description/blob/VideoCaption/cocoeval.py

edited Aug 26 '19 at 11:51

answered Aug 26 '19 at 08:02

Arjun Kava

5,303
3
20
20

Humm, using BLEU score, I obtain a score of 1.6034157163765524e-231 for the same example. I think that BLEU is used for sentences that has the same meaning, but written in other way. Here, some captions describes different details in videos. – wakobu Aug 26 '19 at 08:21
You can use other methods as updated in answer. Generally, versions of BLEU is standard for validating such kind of problems. – Arjun Kava Aug 26 '19 at 11:53

How to evaluate inference accuracy for seq2seq video captionnig model?

Explanation

Example

How can I wisely compute inference accuracy ?

1 Answers1