2

Wanted some ideas about building a tool which can scan text sentences (written in english language) and build a keyword rank, based on the most occurrences of words or phrases within the texts.

This would be very similar to the twitter trends wherin twitter detects and reports the top 10 words within the tweets.

I have identified the high level steps in the algorithm as follows

  1. Scan the text and remove all the common , frequent words ( such as, "the" , "is" , "are", "what" , "at" etc..)
  2. Add the remaining words to a hashmap. If the word is already in the map then increment its count.
  3. To get the top 10 words , iterate through the hashmap and find out the top 10 counts.

Step 2 and 3 are straightforward but I do not know in step 1 how do I detect the important words within a text and segregate them from the common words (prepositions, conjunctions etc )

Also if I want to track phrases what could be the approach ? For example if I have a text saying "This honey is very good" I might want to track "honey" and "good" but I may also want to track the phrases "very good" or "honey is very good"

Any suggestions would be greatly appreciated.

Thanks in advance

Khairul
  • 1,483
  • 1
  • 13
  • 23
Shyam
  • 617
  • 3
  • 7
  • 20

3 Answers3

2

For detecting phrases, I suggest to use chunker. You can use one provided by NLP tool like OpenNLP or Stanford CoreNLP.

NOTE

  • honey is very good is not a phrase. It is clause. very good is a phrase.
  • In Information Retrieval System, those common word are called Stop Words.
Glorfindel
  • 21,988
  • 13
  • 81
  • 109
Khairul
  • 1,483
  • 1
  • 13
  • 23
1

Actually, your step 1 would be quite similar to step 3 in the sense that you may want to constitute an absolute database of the most common words in the English language in the first place. Such a list is available easily on the internet (Wikipedia even has an article referencing the 100 most common words in the English language.) You can store those words in a hashmap and while scanning your text contents just ignore the common tokens.

If you don't trust Wikipedia and the already existing listing for common words, you can build your own database. For that purpose, just scan thousands of tweets (the more the better) and make your own frequency chart.

You're facing an n-gram-like problem.

Do not reinvent the wheel. What you seem to be wanting to do has been done thousands of times, just use existing libs or pieces of code (check the External Links section of the n-gram Wikipedia page.)

Asblarf
  • 483
  • 1
  • 4
  • 14
0

Check out the NLTK library. It has code that does number one two and three:

1 Removing common words can be done using stopwords or a stemmer

2,3 getting the most common words can be done with FreqDist

Second you can use tools from Stanford NLP for tracking your text

user3084006
  • 5,344
  • 11
  • 32
  • 41