Use ddply() to aggregate relative histogram counts

Question

Related to a previous question I asked (ggplot2 how to get 2 histograms with the y value = to count of one / sum of the count of both), I tried to write a function which would take a data.frame as input with the response times (RT) and accuracy (correct) of several participants in several conditions, and output a "summary" data.frame with the data aggregated like in an histogram. The specificity here is that I don't want to get the absolute number of responses in each bin, but the relative count.

What I call relative count is that for each bin of the histogram, the value correspond to:

relative_correct   = ncorrect / sum(ncorrect+nincorrect)
relative_incorrect = nincorrect / sum(ncorrect+nincorrect)

The result is actually close to a density plot, except that it's not the sum of each curve which is equal to 1 but the sum of the correct and incorrect curves.

Here is the code to create sample data:

# CREATE EXAMPLE DATA
subjectname <- factor(rep(c("obs1","obs2"),each=50))
Visibility  <- factor(rep(rep(c("cond1","cond2"),each=25),2)) 
RT          <- rnorm(100,300,50)
correct     <- sample(c(rep(0,25),rep(1,75)),100)
my.data <- data.frame(subjectname,Visibility,RT,correct)

First I need to define a function to be used later in a ddply

histRTcounts <- function(df) {out = hist(df$RT, breaks=seq(5, 800, by=10), plot=FALSE)
                          out = out$counts}

And then the main function (there is 2 small issues which prevent it to work as inside a function, see the lines with ?????, but outside of a function this code works).

relative_hist_count <- function(df, myfactors) {
  require(ggplot2)
  require(plyr)
  require(reshape2)

  # ddply it to get one column for each bin of the histogram
  myhistRTcounts <- ddply(df, c(myfactors,"correct"), histRTcounts)

  # transform it in long format
  myhistRTcounts.long = melt(myhistRTcounts, id.vars =c(myfactors,"correct"), variable.name="bin", value.name = 'mycount')

  # rename the bin names with the ms value they correspond to
  levels(myhistRTcounts.long$bin) <- seq(5, 800, by=10)[-1]-5

  # make them numeric and not a factor anymore
  myhistRTcounts.long$bin = as.numeric(levels(myhistRTcounts.long$bin))[myhistRTcounts.long$bin]

  # cast to have count_correct and count_incorrect as columns
  # ??????????????????????? problem when putting that into a function
  # Here I was not able to figure out how to combine myfactors to the other variables in the call
  myhistRTcount.short = dcast(myhistRTcounts.long, subjectname + Visibility + bin ~ correct)
  names(myhistRTcount.short)[4:5] <- c("countinc","countcor")

  # compute relative counts
  myhistRTcounts.rel <- ddply(myhistRTcount.short, myfactors, transform, 
                          incorrect = countinc / sum(countinc+countcor),
                          correct = countcor / sum(countinc+countcor)
  )
  myhistRTcounts.rel = subset(myhistRTcounts.rel,select=c(-countinc,-countcor))

  myhistRTcounts.rel.long = melt(myhistRTcounts.rel, id.vars = c(myfactors,"bin"), variable.name = 'correct', value.name = 'mycount')

  # ??????????????????????? idem here, problem when putting that into a function to call myfactors
  ggplot(data=myhistRTcounts.rel.long, aes(x=bin, y=mycount, color=factor(correct))) + geom_line() + facet_grid(Visibility ~ subjectname) + xlim(0, 600) + theme_bw()

  return(myhistRTcounts.rel.long)

The call to apply it to the data

new.df = relative_hist_count(my.data, myfactors = c("subjectname","Visibility"))

So first, I would need your help to be able to make that work as a function with the possibility to use the myfactors variable in dcast() and ggplot().

But more importantly, I'm almost sure this function could be written much more elegantly and in a most straightforward manner, with less steps.

Thank you in advance for your help!

Roland · Answer 1 · 2013-01-31T11:01:40.210

Maybe this helps with setting up the data?

countfun <- function(x,...) {
  res <- hist(x,plot=FALSE,...)
  data.frame(counts=res$counts,
             break1=res$breaks[-length(res$breaks)],
             break2=res$breaks[-1])
}

library(plyr)
plot.dat <- ddply(my.data,.(Visibility),function(df){
  res <- ddply(df,.(correct),function(df2) {countfun(df2$RT,breaks=seq(100, 600, by=10))})
  res$freq2 <- res$counts/nrow(df)
  res
})

You probably need the whole parse, eval, as.formula stuff to generalize to arbitrary factors. I don't have time for that right now.

However, if you plan to generalize it might be better to modify the hist function to accept a parameter to use as factor on the counts.

Thanks Roland for the answer. I wrote a new "hist" function which is doing the job I wanted. I'll post the solution when stackexchange will allow me (it seems that new users like me need to wait 8h before being able to post an answer). — shora, Jan 31 '13 at 14:39

score 0 · Accepted Answer · answered Feb 01 '13 at 08:25

Thanks Roland, I did not think about writing a homemade hist function. Please find it below:

RelativeHistRT <- function (df, breaks = seq(5,800,10)) 
{
  distrib.correct   = hist(df$RT[df$correct==1], breaks, right=FALSE, plot=FALSE)
  distrib.incorrect = hist(df$RT[df$correct==0], breaks, right=FALSE, plot=FALSE)

  n.total = sum(distrib.correct$counts) + sum(distrib.incorrect$counts)

  data.frame(bin_mids  = distrib.correct$mids,
         correct   = distrib.correct$counts / n.total,
         incorrect = distrib.incorrect$counts / n.total)
}

And to apply it to my original data.frame and get what I was looking for:

myhistRTcounts <- ddply(my.data, .(subjectname,Visibility), RelativeHistRT)

This is indeed much shorter and does exactly what I was looking for.

Why don't you mark your answer as "accepted" to clearly show that the issue has been resolved? — A5C1D2H2I1M1N2O1R2T1, Aug 02 '13 at 04:16

Use ddply() to aggregate relative histogram counts

2 Answers2