Im a newbie in NLP and i was wondering if it is a good idea to summarize a document that has already been classified into a certain topic through methods such as LDA by considering the Word Embedding retrieved from Word2Vec and the topic-word distribution that has already been generated, to come up with a sentence scoring algorithm. Does this sound like a good approach for creating a summary of a document?
Asked
Active
Viewed 515 times
1 Answers
0
I would like to suggest you this post.
Instead of using Skip-Thought Encoder on the Step 4, you could use pre-trained Word2Vec model from Google or Facebook (check FastText documentation to see how to parse second model or to choose another language).
In general, you will have next steps:
- Text cleaning (delete numbers, but leave punctuation).
- Language detection (to define and delete stopwords, and use appropriate version of Word2Vec model).
- Sentence tokenization (after that you could delete punctuation).
- Tokens encoding (with chosen Word2Vec model).
- Clustering obtained tokens with Kmeans (you should specify number of clusters - it will be equal to number of sentences in the future summary).
- Obtaining summaries (one sentence of the summary is a middle sentence of the one cluster, look original post for more details and code samples).
I hope it will help. Good luck! :)

Tangui
- 3,626
- 2
- 26
- 28

O. Kaminska
- 26
- 1
- 4