In an earlier post of mine I did kmeans_k clustering in an iterative process in pheatmap package to reduce my rows(genes) from 90 to a stricter subset. This I did since when I tried to test optimal clusters on the rows with packages the factoextra
, cluster
, NbClust
where the optimal number of kmeans cluster were pretty low. So I did iterative kmeans_k on my data having 90 rows and 15 columns and kept the row and column clustering switched on with correlation for column and default for row. This made me think that the clusters are already ranked. Is it true that the clusters get ranked in pheatmap? or the one having cluster1 from pheatmap should be the top cluster. I was selecting top clusters based on what my output was and since my data contains both up and down genes to the ones with highest SD was the one as ranked. Is it correct what I was doing? Now I am separating my lists for up and down genes and calculating the optimal kmeans and I found better clusters. Now if I plot them with pheatmap how will I select which should be the top cluster? Since I am now plotting 2 separate heatmaps with kemans based on directionality. Now from these to heatmaps with optimal clusters derived how will I select which is the top cluster? Shall I compute the SD for each cluster? Previous post link
Code for separating based on direction
o.90.df<-90.df[order(90.df$logFC),]
ind<-which(o.90.df$logFC>1)
up.o.90.df<-o.90.df[ind,]
ind<-which(o.90.df$logFC<1)
down.o.90.df<-o.90.df[ind,]
Now creating the dataframe on which optimal clusters will be counted the source dataframe from which the values needs to be imported is
tpm #source dataframe
tpm.up.o.90.df<-tpm[(rownames(tpm) %in% genes.up.o.90.names),]
tpm.down.o.90.df<-tpm[(rownames(tpm) %in% genes.down.o.90.names),]
mydata1<-scale(tpm.up.o.90.df)
my_data2<-scale(tpm.down.o.90.df)
fviz_nbclust(my_data1, kmeans, method = "gap_stat") ## 3 clusters optimal
fviz_nbclust(my_data2, kmeans, method = "gap_stat") ## 5 clusters optimal
now based on what clusters I get am plotting pheatmap:
pheatmap(tpm.up.o.90.df,scale="row",clustering_distance_cols = "correlation",show_rownames= T,show_colnames=T,color=col,annotation=annote,cluster_col=T,fontsize_row = 6,fontsize_col = 7,clustering_method = "ward.D2",border_color = NA,cellwidth = NA,cellheight = NA,kmeans_k = 3)
pheatmap(tpm.down.o.90.df,scale="row",clustering_distance_cols = "correlation",show_rownames= T,show_colnames=T,color=col,annotation=annote,cluster_col=T,fontsize_row = 6,fontsize_col = 7,clustering_method = "ward.D2",border_color = NA,cellwidth = NA,cellheight = NA,kmeans_k = 5)
How shall I select from this heatmap which is the top cluster since there are 2 separate heatmaps. Is it correct to use the clustering of rows and columns here using kmeans_k and making the heatmap with pheatmap? If so how shall I detect the best cluster? By calculating the SD of cluster and see which has the highest SD for a cluster and select that? If someone has any idea. If data is needed along with figures I can upload in a dropbox link. Atleast the data where am doing the pheatmap. Am conceptually broken as of now while doing the separation of direction of genes and maknig kmeans. Appreciate some expert suggestions.