I'm trying to perform fuzzy clustering using WEKA (both GUI and java code). In my data set I have two fields: id and string, so I would like to cluster by these strings and as an output get per each string, array with probabilities of belonging to each cluster.
I wrote the code that defines filter with all required properties (which are the same that in the WEKA GUI), then cluster using EM and then printout distributions using clusterer.distributionForInstance(filteredData...). It runs and even prints out an output, the problem is that for each entry it assigns 1 to one of the clusters and for other clusters it assigns 0. Would you be so kind as to assist and tell me what could be the problem?
I am attaching snippet of my code for further recommendations:
Instances train = DataSource.read(args[0]);
Instances test = DataSource.read(args[1]);
StringToWordVector filter = new StringToWordVector();
filter.setTFTransform(true);
filter.setLowerCaseTokens(true);
filter.setAttributeIndices("last");
filter.setInputFormat(train);
filter.setStopwordsHandler(new Rainbow());
Instances filteredData = Filter.useFilter(train, filter);
Instances testFilteredData = Filter.useFilter(test, filter);
//build clusterer
EM clusterer = new EM();
clusterer.buildClusterer(filteredData);
for (int i = 0; i < testFilteredData.numInstances(); i++) {
double[] dist = clusterer.distributionForInstance(testFilteredData.instance(i));
System.out.print(testFilteredData.instance(i));
System.out.print(Utils.arrayToString(dist));
System.out.println();
}
The procedure works perfectly well if there is no need for tiidf transform. And it also works perfectly if one replaces clustering with the classification (obviously have to add the class attribute).