lexical-level similarity word clustering tool

Question

Is there any open software toolkit that compares the lexcial-level similarities among words and group similar words together? For example, Blue jean, Blue jeans, and blue jea (miss-spelled) should be grouped together? I don't need to look for semantic similarity here.

score 0 · Answer 1 · answered Apr 01 '13 at 12:51

0

Try natural language toolkit http://nltk.org/

Here's a rather abstract treatment of the Brown Clustering algorithm http://www.cs.columbia.edu/~cs4705/lectures/brown.pdf

The standard similarity metric between words is the Levenstein distance http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

answered Apr 01 '13 at 12:51

john mangual

7,718
13
56
95

1

Thanks for answering. And sorry for not being very clear. My question is less about how to calculate the distance between two words, but more about how to choose a cut of value. Say I have a list of words, Blue jeans Blue jean Blue Lucent How do I decide that blue jeans, blue jean should be grouped together, and blue should be grouped together with lucent, not blue jeans? (apologies for a poor example….) – walkman Apr 01 '13 at 13:07
@walkman You can use standard clustering algorithms like k-means to do so using the Levenstein distance (a.k.a. edit distance) as John suggested. – Memming Apr 01 '13 at 14:56

score 0 · Answer 2 · answered Apr 01 '13 at 16:30

I belive you are more interested in stemming than in actual clustering e.g. using Levensthein distance: using an unsupervised textual similarity is way too likely to produce false positives.

From a lexical similarity point of view,

blue jean
blue dean

are just one character different, too. Yet, this is a rather unlikely typo.

You really want to use something supervised such as porter stemmers to match.

lexical-level similarity word clustering tool

2 Answers2