Clustering with weka

Question

I have saved a google query (title and description) of 100 results. It has this format:

Title                Description
Spain - Wikipedia    Spain is a democracy organised in the form of a parliamentary government under a constitutional monarchy. It is a developed country with the world's fourteenth

You get an idea. I successfully load this CSV file into weka. Apply NominalToString filter first (because it loads in Nominal). And then apply the StringToWordVector with the following options:

IDFTransform - True
TFTTransform - T
normalaize - T
outputWordCounts - T
tokenizer - Alphabetical
WordstoKeep - 100

More or less. I then get a list of words, sometimes I use the NGramTokenizer to have at least 3 words.

After that I go to Cluster and choose K-means. This doesn't works very well as it puts 90% in one cluster . Or maybe it is right....

What does happen when I choose Use training set here as I don't have anything yet? What option should I use? I want to form clusters like in categories(Tourism, Sports, Economy,...). Can Weka do that like Carrot2 does? Or at least form clusters.

Thanks.

Yeah I did that, a little better but still like 84% are in the same cluster. Maybe my sample is too little? That assuming what I'm doing is correct.... — EricJ, Jul 03 '15 at 12:12
What have you set for `number of clusters`? Increase that to 10 and check — Ramanan, Jul 03 '15 at 12:18
Clustered Instances 0 1 ( 1%) 1 1 ( 1%) 2 89 ( 91%) ..... This is with 10 clusters and 1000 WordsToKeep. — EricJ, Jul 03 '15 at 12:40
Did you use stemmer and stopwordsHandler options in StringToWordVector? — lanenok, Jul 06 '15 at 20:44
Yeah, the result is the same. Maybe I should use other data. — EricJ, Jul 07 '15 at 07:45

Clustering with weka

0 Answers0