0

It is clear how to train encoder-decoder model for translation: each source sequence has its corresponding target sequence (translation). But in case of text summarization abstract is much shorter than its article. According to Urvashi Khandelwal, Neural Text Summarization each source sentence has its abstract (shorter or longer). But I hardly beleive there is any such dataset exists where each sentence has its corresponding abstract. So, if i am right, what are the possible ways to train sunch model? Otherwise are there any free datasets for text summarization?

ichernob
  • 357
  • 2
  • 13
  • Did you read the paper that you linked? They mention the ACL anthology dataset in there. – Aaron Apr 18 '17 at 22:10
  • @Aaron, of course I read it. As I understand, it contains papers with their abstracts. Am I right? – ichernob Apr 19 '17 at 07:33
  • Yes. I think they use just the title of the paper and the abstract in their experiments. People do other tricks to get data like using a short news article and the headline as the summary. – Aaron Apr 19 '17 at 16:07
  • @Aaron, so it is all about tricks? – ichernob Apr 19 '17 at 19:15

2 Answers2

0

You're right that there are very few large datasets that were created specifically to be used for training text summarization models. People tend to use other existing data and find ways to turn it into a summarization problem. You can read other text summarization papers to see what they do.

Aaron
  • 2,354
  • 1
  • 17
  • 25
0

Research tend to use datasets like

If you need to know more about how to effectively use these models, this blog series goes into details on how to train text summarization model using the newest approaches, it also collects multiple implementations online and implement them in google colab, so no matter the power of your computer, you can always try out these datasets for free on google colab

amr zaki
  • 66
  • 1
  • 6