2

I am debugging a sequence-to-sequence model and purposely tried to perfectly overfit a small dataset of ~200 samples (sentence pairs of length between 5-50). I am using negative log-likelihood loss in pytorch. I get low loss (~1e^-5), but the accuracy on the same dataset is only 33%.

I trained the model on 3 samples as well and obtained 100% accuracy, yet during training I had loss. I was under the impression that negative log-likelihood only gives loss (loss is in the same region of ~1e^-5) if there is a mismatch between predicted and target label?

Is a bug in my code likely?

1 Answers1

2

There is no bug in your code.
The way things usually work in deep nets is that the networks predicts the logits (i.e., log-likelihoods). These logits are then transformed to probability using soft-max (or a sigmoid function). Cross-entropy is finally evaluated based on the predicted probabilities.
The advantage of this approach is that is numerically stable, and easy to train with. On the other side, because of the soft-max you can never have "perfect" 0/1 probabilities for your predictions: That is, even when your network has perfect accuracy it will never assign probability 1 to the correct prediction, but "close to one". As a result, the loss will always be positive (albeit small).

Shai
  • 111,146
  • 38
  • 238
  • 371
  • Thank you, i understand. What about the low loss/low accuracy situation on the training run with 200 samples? Isn't that unusual? – headache666 Jul 14 '20 at 08:14
  • @headache666 how many labels do you have in the 200-dataset? how are the labels distributed? – Shai Jul 14 '20 at 09:18
  • I am trying to parse natural language utterances to meaning representations much alike lambda calculus. The input vocabulary has ~130 words and the output vocabulary has ~70 tokens. I give more detail on my problem here: https://datascience.stackexchange.com/questions/77689/trying-to-reproduce-paper-results-in-neural-semantic-parsing I am using this dataset: https://github.com/jkkummerfeld/text2sql-data/blob/master/data/non-sql-data/geography-logic.txt I guess parentheses and indentifiers are overrepresented in the target dataset. – headache666 Jul 14 '20 at 09:45
  • Also the same target meaning representations are paired with multiple input sentences in the dataset. – headache666 Jul 14 '20 at 10:03