I have a dataframe where each row is a unique user, and there is a column where each user's choice is represented along with columns for background variables. In the actual dataset, there are 800+ choices and background variables can have between 2 and 60+ levels.
What I am doing is finding total counts for each choice based on a variety of different grouping combinations. I have gotten that successfully, but I now want to name them in a meaningful way.
Source that helped for the first part of this question: How to use aggregate with a list of column names
# sample data
d <- data.frame(user_id = 1:12, choice = rep(c("apple", "banana", "banana"),4), cond1 = c("a1", "a1", "a2", "a1", "a2", "a1", "a2", "a2", "a3", "a3", "a3", "a3"), gender = c(rep(c("F", "M"), 6)))
# list of the different grouping combinations (choice will always be one of these terms)
c1 <- list(c("choice", "cond1"), c("choice", "cond1","gender"))
# variable to be aggregated over
v1 <- c("user_id")
varNames <- c("cond1Ct", "cond1_GenderCt") # for later use
counts <- mapply(FUN= function(z,y, ...) {aggregate(d[ , y], by=d[z], ...)},
c1, v1, MoreArgs=list(FUN=length) )
# assign each part of the output to its own dataframe for later use
cond1 <- counts[[1]]
cond1_Gender <- counts[[2]]
So this works so far.
As seen with Name columns within aggregate in R, it is possible to assign a column name to the variable you are aggregating over. I couldn't get this to work in the mapply function call. That is my primary question here --- how can I automate the naming of the tabulated variables for each grouping combination.
My brute force method of doing it:
# rename the column labeled 'x' in a meaningful way ... should be possible to do using varnames (defined above)
cond1_Gender$cond1_Gender_Ct <- cond1_Gender$x
cond1$cond1_Ct <- cond1$x
cond1_Gender$x <- NULL
cond1$x <- NULL
cond1_Gender
# bring back the now meaningfuly named columns into the original dataframe
dUse <- merge(merge(d, cond1), cond1_Gender)
# efficiency of this for large dataset and 8+ aggregated datasets???
dUse_Cond <- dUse[!duplicated(dUse[c("choice", "cond1")]),] # later: use item from list here!
# has extra columns that just get ignored ... but that could be potentially confusing for later users
# repeat for all other grouping variable combinations
# dUse_CondGender <- dUse[!duplicated(dUse[c1[[2]],])] # (DOES NOT WORK!)
How can I accomplish this naming in a more automated fashion so it can work across datafiles and multiple variables?
Bigger Picture: This is only part of a larger situation in which I'm wanting to graph the frequency of the top choices for each grouping combination. I have code that determines the most highly rated occupations for each grouping combination and plots those frequencies, for each of the levels in a grouping category.
ADDED 3-25-14: Clarification of the context in which this arises, although likely beyond the scope of this question.
Individuals have saved a choice (out of a long list). I want to know (graphically & numerically), which choices were the "most frequent" (and their corresponding freq) for different combinations of background grouping variables: i.e., most frequent choices for 5th grade males, or most freq choices for 10th graders in state X, or for all high school students in the treatment condition. I have code that takes me through each individual background grouping: tabulating (as worked on in this problem), finding the 'most frequent' choices, computing statistics on freq for that group, and graphing based on the variables involved in that group. Overall, I'm trying to make this more generic and vectorized so it works more easily with a subsequent user defining what combinations of background variables they want to look at. jlhoward's answer adds count columns for each of the desired combinations to the original dataframe.
Thanks! I hope this makes sense ... I've read a lot of questions and answers here, but haven't tried to post before.