-2

The dataframe for which I am trying to cluster

The above dataframe represents the attributes to determine Whether I have cancer or not. The class represents whther the person has cancer or not. Class-2 shows the person donot have cancer, and 4 represents person has cancer. When I try K-means on the dataframe by removing class and id, I got the prediction as 0,1 for all the rows. But now I am confused whether 0/1 is equivalent to 2. How to fugure this out and also how to check accuracy of my model.

Antony Joy
  • 301
  • 3
  • 15
  • 2
    I’m voting to close this question because it is not about programming as defined in the [help] but about ML theory and/or methodology - please see the intro and NOTE in the `machine-learning` [tag info](https://stackoverflow.com/tags/machine-learning/info). – desertnaut Mar 26 '21 at 08:07

1 Answers1

1

The K-Means algorithm is not a classifier but a clustering algorithm. Which means it does not give you a mapping from the features to the cancer class. It only find clusters (subsets of related datapoints) in the feature space.

Hence the output 0/1 are the memberships of each datapoint to the found clusters.

If you want to check whether the clusters correlate to the cancer classes, do an analysis:

  • How many datapoints in cluster 0 are actually cancer class 2?
  • How many datapoints in cluster 1 are actually cancer class 4?

Also take a look at confusion matrix for information on how to evaluate this kind of problem.

Your confusion matrix should look like this:

+-----------------+-----------------------+-----------------------+
|                 | actual cancer class 4 | actual cancer class 2 |
+-----------------+-----------------------+-----------------------+
| k-Means class 0 | true positive         | false positive        |
| k-Means class 1 | false negative        | true negative         |
+-----------------+-----------------------+-----------------------+
  • true positive: algorithm predicted cancer and person actually has cancer
  • false positive: algorithm predicted cancer but person does not have cancer
  • false negative: algorithm predicted no cancer but person actually has cancer
  • true negative: algorithm predicted no cancer and person does not have cancer
  1. Take only the datapoints, that are in cluster 0; Count how many out of that have cancer class 4 -> This will be your true positives.
  2. Now take only the datapoints, that are in cluster 0; Count how many out of that have cancer class 2 -> This will be your false positives.
  3. Repeat for the negatives.

Accuracy can be calculated using this formula: acc = (TP+TN) / (TP+FP+FN+TN)

Sparkofska
  • 1,280
  • 2
  • 11
  • 34
  • I have checked the count...predicted:::1 -> 451,0 -> 232 original:::2 ---> 444,4--->239 ... so does'nt this denote that person having the disease is denoted by 0 and the person does not have the disease is denoted by 1? – Antony Joy Mar 26 '21 at 07:57
  • And how are you suggesting to use confusuion matrix for checking accuracy?? – Antony Joy Mar 26 '21 at 07:58
  • 1
    To obtain the correlation: you have to calculate the cancer class 4 in k-means cluster 1, instead of the total values. See my edit for details – Sparkofska Mar 26 '21 at 08:20
  • 1
    Take some time and try to deeply understand how confusion matrix works. If you get the theory you will be able to apply it to your specific problem. – Sparkofska Mar 26 '21 at 08:22