I have 4000 (continuous) predictor variables in a set of 150 patients. First, variables with are associated with survival should be identified. I therefore use the multiple testing procedures function (http://svitsrv25.epfl.ch/R-doc/library/multtest/html/MTP.html) with the t-statistic for tests of regression coefficients in Cox proportional hazards survival models to identify significant predictors. This analysis identifies 60 parameters which are significantly associated with survival. I then perform unsupervised k-means clustering with the ConensusClusterPlus package (https://www.bioconductor.org/packages/release/bioc/html/ConsensusClusterPlus.html) which identifies 3 clusters as the optimal solution based on the CDF curve & progression graph. If I then perform a Kaplan-Meier survival analysis I see that each of the three clusters is associated with a distinct survival pattern (low / intermediate / long survival).
The question that I now have is the following: Lets assume that I have another set of 50 patients where I´d like to predict to which of the three clusters each patient most likely belongs to. How can I achieve this? Do I need to train a classifier (e.g. with the caret-package (topepo.github.io/caret/bytag.html) where the 150 patients with the 60 significant parameters are in the training set and the algorithm knows which patient was allocated to which of the three clusters) and validate the classifier in the 50 new patients? And then perform Kaplan-Meier survival analysis to see whether the predicted clusters in the validation set (n=50) are again associated with a a distinct survival pattern?
Thanks for your help.