1

I have 4000 (continuous) predictor variables in a set of 150 patients. First, variables with are associated with survival should be identified. I therefore use the multiple testing procedures function (http://svitsrv25.epfl.ch/R-doc/library/multtest/html/MTP.html) with the t-statistic for tests of regression coefficients in Cox proportional hazards survival models to identify significant predictors. This analysis identifies 60 parameters which are significantly associated with survival. I then perform unsupervised k-means clustering with the ConensusClusterPlus package (https://www.bioconductor.org/packages/release/bioc/html/ConsensusClusterPlus.html) which identifies 3 clusters as the optimal solution based on the CDF curve & progression graph. If I then perform a Kaplan-Meier survival analysis I see that each of the three clusters is associated with a distinct survival pattern (low / intermediate / long survival).

The question that I now have is the following: Lets assume that I have another set of 50 patients where I´d like to predict to which of the three clusters each patient most likely belongs to. How can I achieve this? Do I need to train a classifier (e.g. with the caret-package (topepo.github.io/caret/bytag.html) where the 150 patients with the 60 significant parameters are in the training set and the algorithm knows which patient was allocated to which of the three clusters) and validate the classifier in the 50 new patients? And then perform Kaplan-Meier survival analysis to see whether the predicted clusters in the validation set (n=50) are again associated with a a distinct survival pattern?

Thanks for your help.

lejlot
  • 64,777
  • 8
  • 131
  • 164
user86533
  • 323
  • 1
  • 7
  • 18

2 Answers2

1

The answer is much simpler. You do have your k-means, with 3 clusters. Each cluster is identified by its centroid (a point in your 60-dimensional space). In order to "classify" new point you just measure the euclidean distance to each of these three centroids, and select cluster which is the closest one. That's all. It comes directly from the fact, that k-means gives you partitioning of the whole space, not just your training set.

lejlot
  • 64,777
  • 8
  • 131
  • 164
  • OK thank you for your feedback. I´ve just discovered another posting where a similar question was asked: http://stackoverflow.com/questions/22300830/can-k-means-clustering-do-classification - your solution most likely corresponds to option #2 in the other posting. However option #3 (what I´ve mentioned in the posting) is as far as I understand also a viable solution? – user86533 Nov 09 '15 at 23:04
  • you can do anything, in short words; however building a classifier in order to mimic clustering is pointless, as a clustering is an optimal classifier under this criterion. – lejlot Nov 09 '15 at 23:35
0

My advice is to create a predictive model, such as random forest, using the cluster number as the outcome. It will lead to better results than predicting using the distances in the cluster.

The reasons are several, but consider that a predictive model is specialized in such a task, for example, it will keep and consider reliable variables (while in the cluster every variable will account the same).

Pablo Casas
  • 868
  • 13
  • 15