Clustering words by using numpy and nltk or CLUTO in Python programming

Question

I am trying to clustering some words.
Some part of my data is as below (it's just example).

    cat dog horse ostrich 
cat  8   2.3  3.4  4.7
dog  7   8    3   2.4
horse 3.4 2.5 8  1.5
ostrich 3.4 3.2 4.4 8

The bigger number means that the similarity between two words is higher. Based on this kind of format data, I want to make a clusters (for example, (cat, dog), (horse), (ostrich) totally 3 clusters).

At first, I tried to use CLUTO... to make some clusters and a (very beautiful) graph as below. enter image description here

But I can't... I already saw the manuals but it's not that easy to understand. So, I tried to use some clustering libraries in nltk such as k-means..etc. But I don't know how I can create a graph like above. (also I have to make some clusters based on input data)

I don't really understand what you want. Are you asking for someone give you a tutorial? I recommend you try again with the doc, and come back with some reals and specific doubts. — Raydel Miranda, Dec 26 '13 at 13:51

score 1 · Answer 1 · answered Dec 26 '13 at 14:17

The image you present is of a hierarchical cluster. Unlike "typical" cluster analysis, it shows not one way of clustering the data, but all the possible ways to do it, for all possible numbers of clusters. You get one "cluster set" by counting the intersections of the hierarchy with a arbitrary horizontal line in the hierarchy image.

The K-means algorithm, OTOH, depends on you providing the number of clusters you want, so you can't generate a hierarchy from it. The NLTK doesn't seem to provide tools for hierarchical cluster analysis.

You should probably familiarize yourself with the basic clustering concepts before deciding what output you want

Clustering words by using numpy and nltk or CLUTO in Python programming

1 Answers1