-1

We are required to build a fuzzy system with MATLAB on Qualitative_Bankruptcy Data Set and we were advised to implement Fuzzy Clustering Method on it.

There are 7 attributes (6+1) on the dataset (250 instances) and each independent attribute has 3 possible values, which are Positive, Average, and Negative. Please refer to the dataset for more.

From our understanding, clustering is about grouping instances that exhibit similar properties by calculating the distances between the parameters. So the data could be like this. Picture below is just a dummy data, not relevant to my project. enter image description here

The question is, how is it possible to implement a cluster analysis on a dataset like this.

P,P,A,A,A,P,NB
N,N,A,A,A,N,NB
A,A,A,A,A,A,NB
P,P,P,P,P,P,NB
N,N,N,A,N,A,B
N,N,N,P,N,N,B
N,N,N,N,N,P,B
N,N,N,N,N,A,B
Rex Low
  • 2,069
  • 2
  • 18
  • 47

2 Answers2

0

Well, let's start from reading your data:

clear();
clc();
close all;

opts = detectImportOptions('Qualitative_Bankruptcy.data.txt');
opts.DataLine = 1;
opts.MissingRule = 'omitrow';
opts.VariableNamesLine = 0;
opts.VariableNames = {'IR' 'MR' 'FF' 'CR' 'CO' 'OR' 'Class'};
opts.VariableTypes = repmat({'categorical'},1,7);
opts = setvaropts(opts,'Categories',{'P' 'A' 'N'});
opts = setvaropts(opts,'Class','Categories',{'B' 'NB'});

data = readtable('Qualitative_Bankruptcy.data.txt',opts);
data = rmmissing(data);
data_len = height(data);

Now, since the kmeans function (reference here) accepts only numeric values, we need to convert a table of categorical values into a matrix:

x = double(table2array(data));

And finally, we apply the function:

[idx,c] = kmeans(x,number_of_clusters);

Now comes the problem. The k-means clustering can be performed using a wide variety of distance measures together with a wide variety of options. You have to play with those parameters in order to obtain the clustering that better approximates your available output.

Since k-means clustering organizes your data into n clusters, this means that your output defines more than 3 clusters because 46 + 71 + 61 = 178... and since your data contains 250 observations, 72 of them are assigned to one or more clusters that are unknown to me (and maybe to you too).

If you want to replicate that output, or to find the clustering that better approximate your output... you have to find, if available, an algorithm that minimize the error... or alternatively you can try to brute-force it, for example:

% ...

x = double(table2array(data));

cl1_targ = 46;
cl2_targ = 71;
cl3_targ = 61;

dist = {'sqeuclidean' 'cityblock' 'cosine' 'correlation'};

res = cell(16,3);
res_off = 1;

for i = 1:numel(dist)
    dist_curr = dist{i};

    for j = 3:6
        idx = kmeans(x,j,'Distance',dist_curr); % start parameter needed

        cl1 = sum(idx == 1);
        cl2 = sum(idx == 2);
        cl3 = sum(idx == 3);

        err = abs(cl1 - cl1_targ) + abs(cl2 - cl2_targ) + abs(cl3 - cl3_targ);

        res(res_off,:) = {dist_curr j err};
        res_off = res_off + 1;
    end
end

[min_val,min_idx] = min([res{:,3}]);
best = res(min_idx,1:2);

Don't forget to remember that the kmeans function uses a randomly-chosen starting configuration... so it will end up delivering different solutions for different starting points. Define fixed starting points (means) using the Start parameter, otherwise a different result will be produced every time your run the kmeans function.

Tommaso Belluzzo
  • 23,232
  • 8
  • 74
  • 98
0

Since you asked about fuzzy clustering, you are contradicting yourself.

In fuzzy clustering, every object belongs to every cluster, just to a varying degree (the cluster assignment is "fuzzy").

It's mostly used with numerical data, where you can assume the measurements are not precise either, but come with a fuzzy error, too. So I don't think it makes as much sense on categoricial data.

Now categoricial data tends to cluster really bad beyond counting duplicates. It just has a too coarse resolution. People do all kind of crazy hacks like running k-means on dummy variables, and never seem to question what they actually compute/optimize by doing this. Nor test their result...

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Thanks for the input. I was asking this because my professor looked at my data and suggested us to use FCM but I am not really sure how to implement FCM on categorical data. – Rex Low Dec 29 '17 at 05:03
  • Since you have a labeled *classification* data set, I would not use any clustering at all. – Has QUIT--Anony-Mousse Dec 29 '17 at 08:37
  • Thanks for your opinions, I will now close this thread and accept your answer as clearly our direction to solve the problem is wrong. Will have to discuss further with my supervisor! Thanks! – Rex Low Dec 29 '17 at 14:46