I'm working on a keyword extraction task in which I'd like to extract phrases instead of words. In order to chunk each sentence into meaningful parts, I do a part of speech tagging first and them based on linguistic rule extract only the Noun Phrases. Each noun phrase is a potential keyword to be extracted. However as I only need to extract 'k' keywords for each given document, I need a good way to rank the extracted noun phrases. A simple way is to calculate the TDIDF score for each term (within each noun phrase) and then the score of each noun phrase would be the multiplication of its constituent terms' TDIDF score. I wonder to know whether anyone has a better approach or any idea on my simple naive solution?
-
This is a totally valid approach. Once you've done this, look at what your approach missed, and see if there's a way to tweak the system to produce better results. Do this until you run out of time and/or money. – Dan Oct 16 '15 at 01:02
1 Answers
You can use a sentence splitter e.g. the one in open NLP instead of extracting phrases based on the noun identification since the accuracy of that can be low in practice (you can have multiple nouns in a phrase, and the hardcoded linguistic rule that you employ may not be robust, i.e., work for all possible cases). Extracting a phrase using a statistic model as in openNLP could be better because it comes with a confidence score.
In any case, once you extract the phrases, you can extract keywords by applying the typical NLP pipeline and rank the keywords then using tf-idf.
I wouldn't recommend the multiplication of the tf-idf scores within a phrase, because that would not be meaningful. But that may depend on your application. You want to rank the phrases towards which goal?
Do you need to have a score, similar with the tf-idf, but at sentence level? If you are looking to assign a score to the entire phrase work with both the vector of terms tf-idf and the confidence of sentence extraction.
Or if you search to have a similarity between phrases, then you could keep the tf-idf vectors of each sentence and apply cosine or other similarity technique.

- 919
- 8
- 13