I'm trying to reduce the input data size by first performing a K-means clustering in R then sample 50-100 samples per representative cluster for downstream classification and feature selection.
The original dataset was split 80/20, and then 80% went into K means training. I know the input data has 2 columns of labels and 110 columns of numeric variables. From the label column, I know there are 7 different drug treatments. In parallel, I tested the elbow method to find the optimal K for the cluster number, it is around 8. So I picked 10, to have more data clusters to sample for downstream.
Now I have finished running the model <- Kmeans(), the output list got me a little confused of what to do. Since I have to scale only the numeric variables to put into the kmeans function, the output cluster membership don't have that treatment labels anymore. This I can overcome by appending the cluster membership to the original training data table.
Then for the 10 centroids, how do I find out what the labels are? I can't just do
training_set$centroids <- model$centroids
And most important question, how do I find 100 samples per cluster that are the closeted to their respective centroid?? I have seen one post here in python but no R resources yet. Output 50 samples closest to each cluster center using scikit-learn.k-means library Any pointers?