I've built a video captionnig model.
It consists of a Seq2seq model, taking video as input and outputting natural language
I obtain really good test results but horrible inference results:
Epoch 1 ; Batch loss: 5.181570 ; Batch accuracy: 60.28% ; Test accuracy: 00.89%
...
Epoch 128 ; Batch loss: 0.628466 ; Batch accuracy: 96.31% ; Test accuracy: 00.81%
Explanation
This accuracy is low because of my accuracy function: it compares word by word the given result with the caption.
This computation is adapted for training, because of teacher's forcing mechanism, but not adapted for inference.
Example
True descriptions:
a football match is going on <end>
the football player are made a goal <end>
the crowd cheers as soccer players work hard to gain control of the ball <end>
Generated descrition:
a group of young men play a game of soccer <end>
My model correctly understands what's happening, but it doesn't express it exaclty (word by word) like the awaited description...
For this specific example, accuracy value will be only 1/31.
How can I wisely compute inference accuracy ?
I thought about extracting the keywords of the sentences. Then trying to see if all keywords contained in the predicted sentence can be found somehere in the captions.
But I also have to check if the sentence is a correct english sentence...
Maybe you think about an easier way to compute accuracy. Tell me !