Related to a previous question I asked (ggplot2 how to get 2 histograms with the y value = to count of one / sum of the count of both), I tried to write a function which would take a data.frame as input with the response times (RT) and accuracy (correct) of several participants in several conditions, and output a "summary" data.frame with the data aggregated like in an histogram. The specificity here is that I don't want to get the absolute number of responses in each bin, but the relative count.
What I call relative count is that for each bin of the histogram, the value correspond to:
relative_correct = ncorrect / sum(ncorrect+nincorrect)
relative_incorrect = nincorrect / sum(ncorrect+nincorrect)
The result is actually close to a density plot, except that it's not the sum of each curve which is equal to 1 but the sum of the correct and incorrect curves.
Here is the code to create sample data:
# CREATE EXAMPLE DATA
subjectname <- factor(rep(c("obs1","obs2"),each=50))
Visibility <- factor(rep(rep(c("cond1","cond2"),each=25),2))
RT <- rnorm(100,300,50)
correct <- sample(c(rep(0,25),rep(1,75)),100)
my.data <- data.frame(subjectname,Visibility,RT,correct)
First I need to define a function to be used later in a ddply
histRTcounts <- function(df) {out = hist(df$RT, breaks=seq(5, 800, by=10), plot=FALSE)
out = out$counts}
And then the main function (there is 2 small issues which prevent it to work as inside a function, see the lines with ?????, but outside of a function this code works).
relative_hist_count <- function(df, myfactors) {
require(ggplot2)
require(plyr)
require(reshape2)
# ddply it to get one column for each bin of the histogram
myhistRTcounts <- ddply(df, c(myfactors,"correct"), histRTcounts)
# transform it in long format
myhistRTcounts.long = melt(myhistRTcounts, id.vars =c(myfactors,"correct"), variable.name="bin", value.name = 'mycount')
# rename the bin names with the ms value they correspond to
levels(myhistRTcounts.long$bin) <- seq(5, 800, by=10)[-1]-5
# make them numeric and not a factor anymore
myhistRTcounts.long$bin = as.numeric(levels(myhistRTcounts.long$bin))[myhistRTcounts.long$bin]
# cast to have count_correct and count_incorrect as columns
# ??????????????????????? problem when putting that into a function
# Here I was not able to figure out how to combine myfactors to the other variables in the call
myhistRTcount.short = dcast(myhistRTcounts.long, subjectname + Visibility + bin ~ correct)
names(myhistRTcount.short)[4:5] <- c("countinc","countcor")
# compute relative counts
myhistRTcounts.rel <- ddply(myhistRTcount.short, myfactors, transform,
incorrect = countinc / sum(countinc+countcor),
correct = countcor / sum(countinc+countcor)
)
myhistRTcounts.rel = subset(myhistRTcounts.rel,select=c(-countinc,-countcor))
myhistRTcounts.rel.long = melt(myhistRTcounts.rel, id.vars = c(myfactors,"bin"), variable.name = 'correct', value.name = 'mycount')
# ??????????????????????? idem here, problem when putting that into a function to call myfactors
ggplot(data=myhistRTcounts.rel.long, aes(x=bin, y=mycount, color=factor(correct))) + geom_line() + facet_grid(Visibility ~ subjectname) + xlim(0, 600) + theme_bw()
return(myhistRTcounts.rel.long)
The call to apply it to the data
new.df = relative_hist_count(my.data, myfactors = c("subjectname","Visibility"))
So first, I would need your help to be able to make that work as a function with the possibility to use the myfactors variable in dcast() and ggplot().
But more importantly, I'm almost sure this function could be written much more elegantly and in a most straightforward manner, with less steps.
Thank you in advance for your help!