-1

I've built a video captionnig model.
It consists of a Seq2seq model, taking video as input and outputting natural language

I obtain really good test results but horrible inference results:

Epoch 1 ; Batch loss: 5.181570 ; Batch accuracy: 60.28% ; Test accuracy: 00.89%
...
Epoch 128 ; Batch loss: 0.628466 ; Batch accuracy: 96.31% ; Test accuracy: 00.81% 

Explanation

This accuracy is low because of my accuracy function: it compares word by word the given result with the caption.

This computation is adapted for training, because of teacher's forcing mechanism, but not adapted for inference.

Example

video1734

True descriptions:

  • a football match is going on <end>
  • the football player are made a goal <end>
  • the crowd cheers as soccer players work hard to gain control of the ball <end>

Generated descrition:

a group of young men play a game of soccer <end>

My model correctly understands what's happening, but it doesn't express it exaclty (word by word) like the awaited description...
For this specific example, accuracy value will be only 1/31.

How can I wisely compute inference accuracy ?

I thought about extracting the keywords of the sentences. Then trying to see if all keywords contained in the predicted sentence can be found somehere in the captions.
But I also have to check if the sentence is a correct english sentence...

Maybe you think about an easier way to compute accuracy. Tell me !

wakobu
  • 318
  • 1
  • 11

1 Answers1

1

User Bleu Score aka Bilingual Evaluation Understudy Score to compare hypotheses and references.

def bleu_score(hypotheses, references):
    return nltk.translate.bleu_score.corpus_bleu(references, hypotheses)

Example:

# two references for one document
from nltk.translate.bleu_score import corpus_bleu
references = [[['this', 'is', 'a', 'test'], ['this', 'is' 'test']]]
hypotheses = [['this', 'is', 'a', 'test']]
score = corpus_bleu(references, hypotheses)
print(score)

Output:

1.0

Other methods are:

  1. METEOR

  2. ROUGE_L

  3. CIDEr

Follow: https://github.com/arjun-kava/Video2Description/blob/VideoCaption/cocoeval.py

Arjun Kava
  • 5,303
  • 3
  • 20
  • 20
  • Humm, using BLEU score, I obtain a score of 1.6034157163765524e-231 for the same example. I think that BLEU is used for sentences that has the same meaning, but written in other way. Here, some captions describes different details in videos. – wakobu Aug 26 '19 at 08:21
  • You can use other methods as updated in answer. Generally, versions of BLEU is standard for validating such kind of problems. – Arjun Kava Aug 26 '19 at 11:53