Preserving only domain-specific keywords?

Question

I am trying to determine the most popular keywords for certain class of documents in my collection. Assuming that the domain is "computer science" (which of course, includes networking, computer architecture, etc.) what is the best way to preserve these domain-specific keywords from text? I tried using Wordnet but I am not quite how to best use it to extract this information.

Are there any well-known list of words that I can use as a whitelist considering the fact that I am not aware of all domain-specific keywords beforehand? Or are there any good nlp/machine learning techniques to identity domain specific keywords?

score 6 · Accepted Answer · answered Nov 02 '11 at 22:00

You need a huge training set of documents. Small subset of this collection (but still large set of documents) should represent given domain. Using nltk calculate words statistics taking into account morphology, filter out stopwords. The good statistics is TF*IDF which is roughly a number of occurenses of a word in the domain subset divided by number of documents containing the word in a whole collection. Keywords are words with greatest TF*IDF.

Fred Foo · Answer 2 · 2011-11-05T11:21:57.397

I've used parsimonious language models (LMs, 1, 3) with some success on similar tasks; these separate document-specific terms from general corpus terms. These are known to be stronger than tf-idf statistics, but require setting a parameter when fitting them.

You can find my Python implementation here; to use it, concatenate all your documents for each theme into a single document, then build a ParsimoniousLM from the various themes and fetch the .top(K) terms per document.

Preserving only domain-specific keywords?

2 Answers2