0

In an earlier post of mine I did kmeans_k clustering in an iterative process in pheatmap package to reduce my rows(genes) from 90 to a stricter subset. This I did since when I tried to test optimal clusters on the rows with packages the factoextra, cluster, NbClust where the optimal number of kmeans cluster were pretty low. So I did iterative kmeans_k on my data having 90 rows and 15 columns and kept the row and column clustering switched on with correlation for column and default for row. This made me think that the clusters are already ranked. Is it true that the clusters get ranked in pheatmap? or the one having cluster1 from pheatmap should be the top cluster. I was selecting top clusters based on what my output was and since my data contains both up and down genes to the ones with highest SD was the one as ranked. Is it correct what I was doing? Now I am separating my lists for up and down genes and calculating the optimal kmeans and I found better clusters. Now if I plot them with pheatmap how will I select which should be the top cluster? Since I am now plotting 2 separate heatmaps with kemans based on directionality. Now from these to heatmaps with optimal clusters derived how will I select which is the top cluster? Shall I compute the SD for each cluster? Previous post link

Code for separating based on direction

o.90.df<-90.df[order(90.df$logFC),]
ind<-which(o.90.df$logFC>1) 
up.o.90.df<-o.90.df[ind,]
ind<-which(o.90.df$logFC<1) 
down.o.90.df<-o.90.df[ind,]

Now creating the dataframe on which optimal clusters will be counted the source dataframe from which the values needs to be imported is

tpm #source dataframe
tpm.up.o.90.df<-tpm[(rownames(tpm) %in% genes.up.o.90.names),]

tpm.down.o.90.df<-tpm[(rownames(tpm) %in% genes.down.o.90.names),]

mydata1<-scale(tpm.up.o.90.df)
my_data2<-scale(tpm.down.o.90.df)

fviz_nbclust(my_data1, kmeans, method = "gap_stat") ## 3 clusters optimal
fviz_nbclust(my_data2, kmeans, method = "gap_stat") ## 5 clusters optimal

now based on what clusters I get am plotting pheatmap:
pheatmap(tpm.up.o.90.df,scale="row",clustering_distance_cols = "correlation",show_rownames= T,show_colnames=T,color=col,annotation=annote,cluster_col=T,fontsize_row = 6,fontsize_col = 7,clustering_method = "ward.D2",border_color = NA,cellwidth = NA,cellheight = NA,kmeans_k = 3)

pheatmap(tpm.down.o.90.df,scale="row",clustering_distance_cols = "correlation",show_rownames= T,show_colnames=T,color=col,annotation=annote,cluster_col=T,fontsize_row = 6,fontsize_col = 7,clustering_method = "ward.D2",border_color = NA,cellwidth = NA,cellheight = NA,kmeans_k = 5)

How shall I select from this heatmap which is the top cluster since there are 2 separate heatmaps. Is it correct to use the clustering of rows and columns here using kmeans_k and making the heatmap with pheatmap? If so how shall I detect the best cluster? By calculating the SD of cluster and see which has the highest SD for a cluster and select that? If someone has any idea. If data is needed along with figures I can upload in a dropbox link. Atleast the data where am doing the pheatmap. Am conceptually broken as of now while doing the separation of direction of genes and maknig kmeans. Appreciate some expert suggestions.

Community
  • 1
  • 1
ivivek_ngs
  • 917
  • 3
  • 10
  • 28
  • I am digging into the objects of the kmeans in pheatmap and found the `withinss` which should account for similarity scores. Should not be that one as to identify the top cluster? – ivivek_ngs May 09 '17 at 14:07
  • Can you explain what you mean by "best cluster". In order identify the best number of clusters to use you can look at the SS measure and try to minimise this (check here https://www.r-bloggers.com/finding-optimal-number-of-clusters/). Im not sure is this is what you mean? – JP1 May 09 '17 at 14:35
  • I have an understanding of how to obtain the optimal clusters from a data using kmeans and gap stats. Now when I try to plot those clusters using heatmap , how do I select which is the top cluster? Top here refers to something that can restrict down my subset of rows. My motivation here is to push down my rows to a lower subset based on kmeans_k in pheatmap and select the top cluster. Since these are genes in rows and I want to use them for validation and they should be selected in a way they can define my classification of samples post selection of the cluster of kmeans_k from pheatmap. – ivivek_ngs May 09 '17 at 14:45
  • my appraoch was to obtain optimal number of clusters from my scaled data and then plot that number of optimal cluster on a heatmap and select from it the best cluster . Now am at a loss what should be a measure of best cluster from that. Lets say I have optimal clusters as 5. Now when I plot with `pheatmap` of rows with `kmeans_k=5` which cluster from among the 5 should I select and what criterion should be applicable in most unbiased way. Hope am more clear now – ivivek_ngs May 09 '17 at 14:57

0 Answers0