Conditional stat_summary for ggplot in R

Question

I'd like to write some conditional stats in my graph if the data is bigger than a certain value.

With the kind help of Jack Ryan (Cut data and access groups to draw percentile lines), I could create the following script that groups data into hours and plots the result:

# Read example data
A <- read.csv(url('http://people.ee.ethz.ch/~hoferr/download/data-20130812.csv'))

# Libraries
library(doBy)
library(ggplot2)
library(plyr)
library(reshape2)
library(MASS)
library(scales)

# Sample size function
give.n <- function(x){
       return(c(y = min(x) - 0.2, label = length(x)))
}

# Calculate gaps
gaps <- rep(NA, length(A$Timestamp))
times <- A$Timestamp
loss <- A$pingLoss
gap.start <- 1
gap.end <- 1
for(i in 2:length(A$Timestamp))
{ #For all rows
    if(is.na(A$pingRTT.ms.[i]))
    { #Currently no connection
        if(!is.na(A$pingRTT.ms.[i-1]))
        { #Connection lost now
            gap.start <- i
        }
        if(!is.na(A$pingRTT.ms.[i+1]))
        { # Connection restores next time
            gap.end <- i+1
            gaps[gap.start] <- as.numeric(A$Timestamp[gap.end]-A$Timestamp[gap.start], units="secs")
            loss[gap.start] <- gap.end - gap.start
        } 
    }       
}              
H <- data.frame(times, gaps, loss)
H <- H[complete.cases(H),]
C <- H      
C$dates <- strptime(C$times, "%Y-%m-%d %H:%M:%S")
C$h1 <- C$dates$hour

# Calculate percentiles
cuts <- c(1, .75, .5, .25, 0)
c <- ddply(C, .(h1), function (x) { summarise(x, y = quantile(x$gaps, cuts)) } )
c$cuts <- cuts
c <- dcast(c, h1 ~ cuts, value.var = "y")
c.melt <- melt(c, id.vars = "h1")

p <- ggplot(c.h1.melt, aes(x = h1, y = value, color = variable)) +
geom_point(size = 4) +
stat_summary(fun.data = max.n, geom = "text", fun.y = max, colour = "red", angle = 90, size=4) +
scale_colour_brewer(palette="RdYlBu", name="Percentile", guide = guide_legend(reverse=TRUE)) +
scale_x_continuous(breaks=0:23, limits = c(0,23)) +
annotation_logticks(sides = "lr") +
theme_bw() +
scale_y_log10(breaks=c(1e0,1e1,1e2,1e3,1e4), labels = trans_format("log10", math_format(10^.x)), limits=c(1e0,1e4)) +
xlab("Hour of day") + ylab("Ping gaps [s]")
p

p <- ggplot(c.m1.melt, aes(x = m1/60, y = value, color = variable)) +
geom_point(size = 1) +
stat_summary(fun.data = give.n, geom = "text", fun.y = median, angle = 90, size=4) +
stat_summary(fun.data = max.n, geom = "text", fun.y = max, colour = "red", angle = 90, size=4) +
scale_colour_brewer(palette="RdYlBu", name="Percentile", guide = guide_legend(reverse=TRUE)) +
scale_x_continuous(breaks=0:23, limits = c(0,24)) +
annotation_logticks(sides = "lr") +
theme_bw() +
scale_y_log10(breaks=c(1e0,1e1,1e2,1e3,1e4), labels = trans_format("log10", math_format(10^.x)), limits=c(1e0,1e4)) +
xlab("Time of day") + ylab("Ping gaps [s]")
p

This creates an hourly grouped plot of gaps with the length of the longest gaps written right next to the data points:

Hourly grouped plot without sample number per group

Below is the minutely grouped plot. The number are unreadable why I'd like to add conditional stats if the gap is longer than 5 minutes or only for the ten longest gaps or something like this.

Minutely grouped plot with unreadable stats

I tried to just change the stat function to

max.n.filt <- function(x){
    filter = 300
    if ( x > filter ) {
      return(c(y = max(x) + 0.4, label = round(max(10^x),2)))
    } else {
        return(c(y=x, label = ""))
    }
}

and use this for the minutely grouped plot. But I got this error:

Error in list_to_dataframe(res, attr(.data, "split_labels")) : 
  Results do not have equal lengths
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Error in if (nrow(layer_data) == 0) return() : argument is of length zero
Calls: print ... print.ggplot -> ggplot_gtable -> Map -> mapply -> <Anonymous>
In addition: Warning message:
Removed 6 rows containing missing values (geom_point).

In addition, in the hourly plot, I'd like to write the number of samples per hour right next to the length of the gaps. I think I can add a new column to the c data frame, but unfortunately I can't find a way to do this.

Any help is very much appreciated.

score 2 · Accepted Answer · answered Aug 23 '13 at 23:35

See ?stat_summary.

fun.data : Complete summary function. Should take data frame as input and return data frame as output

Your function max.n.filt uses an if() statement that tries to evaluate the condition x > filter. But when length(x) > 1, the if() statement only evaluates the condition for the first value of x. When used on a data frame, this will return a list cobbled together from the original input x and whatever label the if() statement returns.

> max.n.filt(data.frame(x=c(10,15,400)))
$y.x
[1]  10  15 400

$label
[1] ""

Try a function that uses ifelse() instead:

max.n.filt2 <- function(x){
    filter = 300                  # whatever threshold
    y = ifelse( x > filter, max(x) + 1, x[,1] )
    label = ifelse( x > filter, round(max(x),2), NA )
    return(data.frame(y=y[,1], label=label[,1]))
}

> max.n.filt2(data.frame(x=c(10,15,400)))
    y label
1  10    NA
2  15    NA
3 401   400

Alternatively, you might just find it easier to use geom_text(). I can't reproduce your example, but here's a simulated dataset:

set.seed(101)
sim_data <- expand.grid(m1=1:1440, variable=factor(c(0,0.25,0.5,0.75,1)))
sim_data$sample_size <- sapply(1:1440, function(.) sample(1:25, 1, replace=T))
sim_data$value = t(sapply(1:1440, function(.) quantile(rgamma(sim_data$sample_size, 0.9, 0.5),c(0,0.25,0.5,0.75,1))))[1:(1440*5)]

Just use the subset argument in geom_text() to select those points you wish to label:

ggplot(sim_data, aes(x = m1/60, y = value, color = variable)) +
geom_point(size = 4) + geom_text(aes(label=round(value)), subset = .(variable == 1 & value > 25), angle = 90, size = 4, colour = "red", hjust = -0.5)

If you have a column of sample sizes, those can be incorporated into label with paste():

ggplot(sim_data, aes(x = m1/60, y = value, color = variable)) +
geom_point(size = 4) + geom_text(aes(label=paste(round(value),", N=",sample_size)), subset = .(variable == 1 & value > 25), angle = 90, size = 4, colour = "red", hjust = -0.25)

(or create a separate column in your data with whatever labels you want.) If you're asking about how to retrieve the sample sizes, you could modify your call to ddply() like this:

...
c2 <- ddply(C, .(h1), function (x) { cbind(summarise(x, y = quantile(x$gaps, cuts)), n=nrow(x)) } )
c2$cuts <- cuts
c2 <- dcast(c2, h1 + n ~ cuts, value.var = "y")
c2.h1.melt <- melt(c2, id.vars = c("h1","n"))
...

Conditional stat_summary for ggplot in R

1 Answers1