R: How to aggregate data into percentages without missing data for stacked-bar plot in ggplot2?

Question

I would like to summarize my "karyotype" molecular data by location and substrate (see sample data below) as percentages in order to create a stack-bar plot in ggplot2.

I have figured out how to use 'dcast' to get a total for each karyotype, but cannot figure out how to get a percent for each of the three karyotypes (i.e. 'BB', 'BD', 'DD').

The data should be in a format to make a stacked bar plot in 'ggplot2'.

Sample Data:

library(reshape2)
Karotype.Data <- structure(list(Location = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L), .Label = c("Kampinge", "Kaseberga", "Molle", "Steninge"
), class = "factor"), Substrate = structure(c(1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 
2L, 2L, 2L, 2L, 2L), .Label = c("Kampinge", "Kaseberga", "Molle", 
"Steninge"), class = "factor"), Karyotype = structure(c(1L, 3L, 
4L, 4L, 3L, 3L, 4L, 4L, 4L, 3L, 1L, 4L, 3L, 4L, 4L, 3L, 1L, 4L, 
3L, 3L, 4L, 3L, 4L, 3L, 3L), .Label = c("", "BB", "BD", "DD"), class = "factor")), .Names = c("Location", 
"Substrate", "Karyotype"), row.names = c(135L, 136L, 137L, 138L, 
139L, 165L, 166L, 167L, 168L, 169L, 236L, 237L, 238L, 239L, 240L, 
326L, 327L, 328L, 329L, 330L, 426L, 427L, 428L, 429L, 430L), class = "data.frame")

## Summary count for each karoytype ##
Karyotype.Summary <- dcast(Karotype.Data , Location + Substrate ~ Karyotype, value.var="Karyotype", length)

Perhaps you need to do `Karyotype.Summary[,3:5] <- Karyotype.Summary[,3:5]/rowSums(Karyotype.Summary[,3:5])*100` — Marat Talipov, Mar 16 '15 at 15:37

score 1 · Answer 1 · answered Mar 16 '15 at 16:30

1

You can use the dplyr package:

library(dplyr)
z.counts <- Karotype.Data %>% 
  group_by(Location,Substrate,Karyotype) %>% 
  summarize(freq=n()) 

z.freq <- z.counts %>% 
  group_by(Location,Substrate) %>% 
  mutate(freq=freq/sum(freq)*100)

Here, the data remain in the long format, so it is straightforward to build the barplot with ggplot:

library(ggplot2)
ggplot(z.freq) + 
  aes(x=Karyotype,y=freq) + 
  facet_grid(Location~Substrate) + 
  geom_bar(stat='identity')

enter image description here

answered Mar 16 '15 at 16:30

Marat Talipov

13,064
5
34
53

I really appreciate your help, but two issues. I do not want to include the data with missing karyotype data and second, I would like a stacked-bar plot. That said, if I can now figure out how to remove the records with the missing karyotype data, I think I can make the correct plots. Thanks! – Keith W. Larson Mar 16 '15 at 20:18

Keith W. Larson · Accepted Answer · 2015-03-16T20:53:41.147

With some help from 'Marat Talipov' and many other answers to questions on Stackoverflow I found out that it is important to load 'plyr' before 'dplyr' and to use 'summarise' rather than 'summarize'. Then removing the missing data was the last step using 'filter'.

library(dplyr)
z.counts <- Karotype.Data %>% 
  group_by(Location,Substrate,Karyotype) %>% 
  summarise(freq=n()) 

z.freq <- z.counts %>% filter(Karyotype != '') %>% 
  group_by(Location,Substrate) %>% 
  mutate(freq=freq/sum(freq))
z.freq

library (ggplot2)
ggplot(z.freq, aes(x=Substrate, y=freq, fill=Karyotype)) +
  geom_bar(stat="identity") +
  facet_wrap(~ Location)

Now I have created the plot I was looking for:

enter image description here

R: How to aggregate data into percentages without missing data for stacked-bar plot in ggplot2?

2 Answers2