1

I want a barplot based on the number of occurrences of a string in a particular column in a dataset in r.

At the same time, I want to run a t-test and plot the significant p-values using stars on the top of the bars. The nonsignificant can be represented as ns.

My attempt has been:

barplot(prop.table(table(ttcluster_dataset$Phenotype)),col=clustercolor,border="black",xlab="Phenotypes",ylab="Percentage of Samples expressed",main="Sample wise Phenotype distribution",cex.names = 0.8)

The dataset column is:

ttcluster_dataset$Phenotype<- 
structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L), .Label = c("Proneural (Cluster 1)", "Proneural (Cluster 2)", "Neural (Cluster 1)", "Neural (Cluster 2)", 
"Classical (Cluster 1)", "Classical (Cluster 2)", "Mesenchymal (Cluster 1)", 
"Mesenchymal (Cluster 2)"), class = "factor")

All suggestions shall be apprciated.

driver
  • 273
  • 1
  • 13
  • 1
    A t-test compares the means of two vectors. Your barplot has 8 bars. What do you want to compare? It might make sense to split Phenotype into two variables, Phenotype and Cluster. – dcarlson Jun 23 '22 at 03:27
  • @dcarlson I want a t-test between the same phenotype of the two different clusters. If you look closely there is **neural** coming twice once with **(cluster 1)** and secondly with **(cluster 2)**. Similarly for **proneural**, **classical** and **mesenchymal**. I hope it is now more clear. – driver Jun 23 '22 at 09:40

1 Answers1

1

A t-test is probably not what you want since you are looking at counts and proportions between the two clusters. Your data is not really set up to do either one so first we need to split the two variables:

Pheno.splt <- strsplit(as.character(ttcluster_dataset$Phenotype), " ")
Pheno.mat <- do.call(rbind, x)[, c(1, 3)]
ttclust <- data.frame(Phenotype=Pheno.mat[, 1], Cluster=gsub(")", "", Pheno.mat[, 2]))
str(ttclust)
# 'data.frame': 171 obs. of  2 variables:
#  $ Phenotype: chr  "Proneural" "Proneural" "Proneural" "Proneural" ...
#  $ Cluster  : chr  "1" "1" "1" "1" ...

Now Phenotype and Cluster are separate columns in the data frame. There are multiple ways to do this, but here we just split your Phenotype into three parts by splitting on the space between them. Now ttclust is as data frame with two variables. Now a summary table and bar plot:

tbl <- xtabs(~Phenotype+Cluster, ttclust)
tbl
#              Cluster
# Phenotype      1  2
#   Classical   32  6
#   Mesenchymal 44 10
#   Neural      26  0
#   Proneural   45  8
tbl.row <- prop.table(tbl, 1)
barplot(t(tbl.row), beside=TRUE)

Barplot

At this point, a simple proportions test indicates that there is no difference in percent of Cluster 1 across the four Phenotypes:

prop.test(tbl)

4-sample test for equality of proportions without continuity correction

data:  tbl
X-squared = 5.2908, df = 3, p-value = 0.1517
alternative hypothesis: two.sided
sample estimates:
   prop 1    prop 2    prop 3    prop 4 
0.8421053 0.8148148 1.0000000 0.8490566 

Using `prop.test' on each Phenotype indicates that Cluster 1 is significantly difference from Cluster 2 in every case:

for(i in 1:4) print(prop.test(t(tbl[i, ])))

# First test
# 
#   1-sample proportions test with continuity correction
# 
# data:  t(tbl[i, ]), null probability 0.5
# X-squared = 16.447, df = 1, p-value = 5.002e-05
# alternative hypothesis: true p is not equal to 0.5
# 95 percent confidence interval:
#  0.6807208 0.9341311
# sample estimates:
#         p 
# 0.8421053 
    . . . .
dcarlson
  • 10,936
  • 2
  • 15
  • 18