4

I currently generate the following plot using ggplot in R:

The data is stored in a single dataframe with three columns: PDF (y-axis in the plot above), mids(x) and dataset name. This is created from histograms.
What I want to do is to plot a color-coded vertical line for each dataset representing the 95th quantile, like I manually painted below as an example:

I tried to use + geom_line(stat="vline", xintercept="mean") but of course I'm looking for the quantiles, not for the mean, and AFAIK ggplot does not allow that. Colors are fine.
I also tried + stat_quantile(quantiles = 0.95) but I'm not sure what it does exactly. Documentation is very scarce. Colors, again, are fine.

Please note that density values are very low, down to 1e-8. I don't know if the quantile() function likes that.

I understand that calculating the quantile of an histogram is not quite the same as calculating that of a list of numbers. I don't know how it would help, but the HistogramToolspackage contains an ApproxQuantile() function for histogram quantiles.

Minimum working example is included below. As you can see I obtain a data frame from each histogram, then bind the dataframes together and plot that.

library(ggplot2)
v <- c(1:30, 2:50, 1:20, 1:5, 1:100, 1, 2, 1, 1:5, 0, 0, 0, 5, 1, 3, 7, 24, 77)
h <- hist(v, breaks=c(0:100))
df1 <- data.frame(h$mids,h$density,rep("dataset1", 100))
colnames(df1) <- c('Bin','Pdf','Dataset')
df2 <- data.frame(h$mids*2,h$density*2,rep("dataset2", 100))
colnames(df2) <- c('Bin','Pdf','Dataset')
df_tot <- rbind(df1, df2)

ggplot(data=df_tot[which(df_tot$Pdf>0),], aes(x=Bin, y=Pdf, group=Dataset, colour=Dataset)) +
geom_point(aes(color=Dataset), alpha = 0.7, size=1.5)
Glorfindel
  • 21,988
  • 13
  • 81
  • 109
AF7
  • 3,160
  • 28
  • 63

1 Answers1

3

Precomputing these values and plotting them separately seems like the simplest option. Doing so with dplyr requires minimal effort:

library(dplyr)
q.95 <- df_tot %>%
  group_by(Dataset) %>%
  summarise(Bin_q.95 = quantile(Bin, 0.95))

ggplot(data=df_tot[which(df_tot$Pdf>0),], 
       aes(x=Bin, y=Pdf, group=Dataset, colour=Dataset)) +
  geom_point(aes(color=Dataset), alpha = 0.7, size=1.5) + 
  geom_vline(data = q.95, aes(xintercept = Bin_q.95, colour = Dataset))

enter image description here

tonytonov
  • 25,060
  • 16
  • 82
  • 98
  • 1
    This plots the 95th quantile of Bin tho. For example, in the above plot the red dataset goes from 0 to 100, so the 95th quantile is simply 95, no matter what the densities are. That is, `quantile(c(0:100), 0.95)`. Same for the blue one. Unfortunately, I don't have access to the whole array of data before histogramming, because it is too big to fit in memory. This is why I need to use histograms. For each layer of the file, I create a histogram. I then merge them in a single histogram with `HistogramTools::AddHistrograms`. – AF7 Nov 24 '14 at 12:55
  • This is just a demo. The idea behind it is that you'll have to compute quantiles ahead and plot them from a separate data frame. I do not know how to compute these correctly, it seems that your data is rather complicated. If your question is essentially about how to compute quantiles for binned data (not about how to use `geom_vline`), let me know, I'll delete the answer. – tonytonov Nov 24 '14 at 13:18
  • No need to delete. I already know how to quantile a binned dataset, I can use ApproxQuantile(). In fact, I'll mark your answer as accepted as it nudged me in the right direction, which is to save the quantiles before creating the data frames, then creating a dataframe with them and plotting it with geom_vline. I had something along these lines in mind but I was not quite able to focus it. – AF7 Nov 24 '14 at 13:44