2

I have a dataframe in R like this:

dat = data.frame(Sample = c(1,1,2,2,3), Start = c(100,300,150,200,160), Stop = c(180,320,190,220,170))

And I would like to plot it such that the x-axis is the position and the y-axis is the number of samples at that position, with each sample in a different colour. So in the above example you would have some positions with height 1, some with height 2 and one area with height 3. The aim being to find regions where there are a large number of samples and what samples are in that region.

i.e. something like:

      &
     ---
********-  --       **

where * = Sample 1, - = Sample 2 and & = Sample 3

yoda230
  • 449
  • 6
  • 14

2 Answers2

2

My first try:

dat$Sample = factor(dat$Sample)
ggplot(aes(x = Start, y = Sample, xend = Stop, yend = Sample, color = Sample), data = dat) + 
  geom_segment(size = 2) + 
  geom_segment(aes(x = Start, y = 0, xend = Stop, yend = 0), size = 2, alpha = 0.2, color = "black")

enter image description here

I combine two segment geometries here. One draws the colored vertical bars. These show where Samples have been measured. The second geometry draws the grey bar below where the density of the samples is shown. Any comments to improve on this quick hack?

Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
  • That might work, thanks, but I would really prefer it if the y-axis was the total number in that particular region i.e. on the 1 line you can have more than one type of sample. This is so you could easily visualise a cutoff of at least 3 samples say for common regions. – yoda230 Dec 09 '11 at 14:26
1

This hack may be what you're looking for, however I've greatly increased the size of the dataframe in order to take advantage of stacking by geom_histogram.

library(ggplot2)
dat = data.frame(Sample = c(1,1,2,2,3), 
                 Start = c(100,300,150,200,160), 
                 Stop = c(180,320,190,220,170))

# Reformat the data for plotting with geom_histogram.
dat2 = matrix(ncol=2, nrow=0, dimnames=list(NULL, c("Sample", "Position")))

for (i in seq(nrow(dat))) {
    Position = seq(dat[i, "Start"], dat[i, "Stop"])
    Sample = rep(dat[i, "Sample"], length(Position))
    dat2 = rbind(dat2, cbind(Sample, Position))
}

dat2 = as.data.frame(dat2)
dat2$Sample = factor(dat2$Sample)

plot_1 = ggplot(dat2, aes(x=Position, fill=Sample)) +
         theme_bw() +
         opts(panel.grid.minor=theme_blank(), panel.grid.major=theme_blank()) +
         geom_hline(yintercept=seq(0, 20), colour="grey80", size=0.15) +
         geom_hline(yintercept=3, linetype=2) +
         geom_histogram(binwidth=1) +
         ylim(c(0, 20)) +
         ylab("Count") +
         opts(axis.title.x=theme_text(size=11, vjust=0.5)) +
         opts(axis.title.y=theme_text(size=11, angle=90)) +
         opts(title="Segment Plot")

png("plot_1.png", height=200, width=650)
print(plot_1)
dev.off()

Note that the way I've reformatted the dataframe is a bit ugly, and will not scale well (e.g. if you have millions of segments and/or large start and stop positions).

enter image description here

bdemarest
  • 14,397
  • 3
  • 53
  • 56
  • Thanks. That looks really good. It's genomic data so there is large start and stop positions but maybe I can rescale it to just take chunks at 100 intervals or something like that. – yoda230 Dec 12 '11 at 10:10