I have saved a google query (title and description) of 100 results. It has this format:
Title Description
Spain - Wikipedia Spain is a democracy organised in the form of a parliamentary government under a constitutional monarchy. It is a developed country with the world's fourteenth
You get an idea. I successfully load this CSV file into weka. Apply NominalToString filter first (because it loads in Nominal). And then apply the StringToWordVector with the following options:
IDFTransform - True
TFTTransform - T
normalaize - T
outputWordCounts - T
tokenizer - Alphabetical
WordstoKeep - 100
More or less. I then get a list of words, sometimes I use the NGramTokenizer to have at least 3 words.
After that I go to Cluster and choose K-means. This doesn't works very well as it puts 90% in one cluster . Or maybe it is right....
What does happen when I choose Use training set here as I don't have anything yet? What option should I use? I want to form clusters like in categories(Tourism, Sports, Economy,...). Can Weka do that like Carrot2 does? Or at least form clusters.
Thanks.