0

Is there any open software toolkit that compares the lexcial-level similarities among words and group similar words together? For example, Blue jean, Blue jeans, and blue jea (miss-spelled) should be grouped together? I don't need to look for semantic similarity here.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
walkman
  • 109
  • 2
  • 3
  • 8

2 Answers2

0

Try natural language toolkit http://nltk.org/

Here's a rather abstract treatment of the Brown Clustering algorithm http://www.cs.columbia.edu/~cs4705/lectures/brown.pdf

The standard similarity metric between words is the Levenstein distance http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

john mangual
  • 7,718
  • 13
  • 56
  • 95
  • 1
    Thanks for answering. And sorry for not being very clear. My question is less about how to calculate the distance between two words, but more about how to choose a cut of value. Say I have a list of words, Blue jeans Blue jean Blue Lucent How do I decide that blue jeans, blue jean should be grouped together, and blue should be grouped together with lucent, not blue jeans? (apologies for a poor example….) – walkman Apr 01 '13 at 13:07
  • @walkman You can use standard clustering algorithms like k-means to do so using the Levenstein distance (a.k.a. edit distance) as John suggested. – Memming Apr 01 '13 at 14:56
0

I belive you are more interested in stemming than in actual clustering e.g. using Levensthein distance: using an unsupervised textual similarity is way too likely to produce false positives.

From a lexical similarity point of view,

blue jean
blue dean

are just one character different, too. Yet, this is a rather unlikely typo.

You really want to use something supervised such as porter stemmers to match.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194