How to check the accuracy of k-means clustering in python? How to know what the predicted variables represent in k-means algorithm?

Question

The above dataframe represents the attributes to determine Whether I have cancer or not. The class represents whther the person has cancer or not. Class-2 shows the person donot have cancer, and 4 represents person has cancer. When I try K-means on the dataframe by removing class and id, I got the prediction as 0,1 for all the rows. But now I am confused whether 0/1 is equivalent to 2. How to fugure this out and also how to check accuracy of my model.

I’m voting to close this question because it is not about programming as defined in the [help] but about ML theory and/or methodology - please see the intro and NOTE in the `machine-learning` [tag info](https://stackoverflow.com/tags/machine-learning/info). — desertnaut, Mar 26 '21 at 08:07

Sparkofska · Accepted Answer · 2021-03-26T08:23:37.903

The K-Means algorithm is not a classifier but a clustering algorithm. Which means it does not give you a mapping from the features to the cancer class. It only find clusters (subsets of related datapoints) in the feature space.

Hence the output 0/1 are the memberships of each datapoint to the found clusters.

If you want to check whether the clusters correlate to the cancer classes, do an analysis:

How many datapoints in cluster 0 are actually cancer class 2?
How many datapoints in cluster 1 are actually cancer class 4?

Also take a look at confusion matrix for information on how to evaluate this kind of problem.

Your confusion matrix should look like this:

+-----------------+-----------------------+-----------------------+
|                 | actual cancer class 4 | actual cancer class 2 |
+-----------------+-----------------------+-----------------------+
| k-Means class 0 | true positive         | false positive        |
| k-Means class 1 | false negative        | true negative         |
+-----------------+-----------------------+-----------------------+

true positive: algorithm predicted cancer and person actually has cancer
false positive: algorithm predicted cancer but person does not have cancer
false negative: algorithm predicted no cancer but person actually has cancer
true negative: algorithm predicted no cancer and person does not have cancer

Take only the datapoints, that are in cluster 0; Count how many out of that have cancer class 4 -> This will be your true positives.
Now take only the datapoints, that are in cluster 0; Count how many out of that have cancer class 2 -> This will be your false positives.
Repeat for the negatives.

Accuracy can be calculated using this formula: acc = (TP+TN) / (TP+FP+FN+TN)

I have checked the count...predicted:::1 -> 451,0 -> 232 original:::2 ---> 444,4--->239 ... so does'nt this denote that person having the disease is denoted by 0 and the person does not have the disease is denoted by 1? — Antony Joy, Mar 26 '21 at 07:57
And how are you suggesting to use confusuion matrix for checking accuracy?? — Antony Joy, Mar 26 '21 at 07:58
To obtain the correlation: you have to calculate the cancer class 4 in k-means cluster 1, instead of the total values. See my edit for details — Sparkofska, Mar 26 '21 at 08:20
Take some time and try to deeply understand how confusion matrix works. If you get the theory you will be able to apply it to your specific problem. — Sparkofska, Mar 26 '21 at 08:22

How to check the accuracy of k-means clustering in python? How to know what the predicted variables represent in k-means algorithm?

1 Answers1