4

I have been testing textsum with both the binary data and gigaword data, trained models and tested. The beam search decoder gives me all 'UNK' results with both set of data and models. I was using the default parameter settings.

I first changed the data interface in data.py and batch_reader.py to read and parse the article and abstract from gigaword dataset. I trained a model with over 90K mini-batches on a roughly 1.7 million documents. Then I tested the model on a different test set but it returned all results. decoder result from model trained with gigaword

Then I used the binary data that comes along with the textsum code to train a small model with less than 1k mini-batches. I tested on the same binary data. It gives all results in the decoding file except a few 'for' and '.'. decoder result from model trained with binary data I also viewed the tensorboard on training loss and it shows training converged.

In training and testing, I didn't change any of the default settings. Have anyone tried the same thing as I did and found the same issue?

Rui
  • 41
  • 2
  • 1
    I too am seeing the same thing. I unfortunately do not have access to the gigaword dataset, but when trying to perform training against the toy dataset and then running decode, I too am seeing results similar to your "decoder result from model trained with binary data" link. I was going to post here as well asking if there is something I am missing and saw that you had already posted. Any luck figuring this out? – xtr33me Sep 25 '16 at 22:25
  • 1
    Yes, the "" issue is caused by the vocab. I regenerated the vocab on the gigaword data and trained on it. Now it has some meaningful results. – Rui Oct 07 '16 at 18:42

1 Answers1

2

I think I found why it is happening with at least the given toy data set. In my case, I trained and tested with the same toy set given (the data & vocab files). The reason why I'm getting [UNK]s in decoder result is the vocab file doesn't contain any words that appear in the summaries of toy data set. Due to that reason the decoder couldn't find the words to decoding with hence using [UNK] in the final result

TUMU. S
  • 41
  • 3