Well, let's start from reading your data:
clear();
clc();
close all;
opts = detectImportOptions('Qualitative_Bankruptcy.data.txt');
opts.DataLine = 1;
opts.MissingRule = 'omitrow';
opts.VariableNamesLine = 0;
opts.VariableNames = {'IR' 'MR' 'FF' 'CR' 'CO' 'OR' 'Class'};
opts.VariableTypes = repmat({'categorical'},1,7);
opts = setvaropts(opts,'Categories',{'P' 'A' 'N'});
opts = setvaropts(opts,'Class','Categories',{'B' 'NB'});
data = readtable('Qualitative_Bankruptcy.data.txt',opts);
data = rmmissing(data);
data_len = height(data);
Now, since the kmeans
function (reference here) accepts only numeric values, we need to convert a table of categorical
values into a matrix:
x = double(table2array(data));
And finally, we apply the function:
[idx,c] = kmeans(x,number_of_clusters);
Now comes the problem. The k-means clustering
can be performed using a wide variety of distance measures together with a wide variety of options. You have to play with those parameters in order to obtain the clustering that better approximates your available output.
Since k-means clustering
organizes your data into n
clusters, this means that your output defines more than 3
clusters because 46 + 71 + 61 = 178
... and since your data contains 250
observations, 72
of them are assigned to one or more clusters that are unknown to me (and maybe to you too).
If you want to replicate that output, or to find the clustering that better approximate your output... you have to find, if available, an algorithm that minimize the error... or alternatively you can try to brute-force it, for example:
% ...
x = double(table2array(data));
cl1_targ = 46;
cl2_targ = 71;
cl3_targ = 61;
dist = {'sqeuclidean' 'cityblock' 'cosine' 'correlation'};
res = cell(16,3);
res_off = 1;
for i = 1:numel(dist)
dist_curr = dist{i};
for j = 3:6
idx = kmeans(x,j,'Distance',dist_curr); % start parameter needed
cl1 = sum(idx == 1);
cl2 = sum(idx == 2);
cl3 = sum(idx == 3);
err = abs(cl1 - cl1_targ) + abs(cl2 - cl2_targ) + abs(cl3 - cl3_targ);
res(res_off,:) = {dist_curr j err};
res_off = res_off + 1;
end
end
[min_val,min_idx] = min([res{:,3}]);
best = res(min_idx,1:2);
Don't forget to remember that the kmeans
function uses a randomly-chosen starting configuration... so it will end up delivering different solutions for different starting points. Define fixed starting points (means) using the Start
parameter, otherwise a different result will be produced every time your run the kmeans
function.