0

I have a dataframe where each row is a unique user, and there is a column where each user's choice is represented along with columns for background variables. In the actual dataset, there are 800+ choices and background variables can have between 2 and 60+ levels.

What I am doing is finding total counts for each choice based on a variety of different grouping combinations. I have gotten that successfully, but I now want to name them in a meaningful way.

Source that helped for the first part of this question: How to use aggregate with a list of column names

# sample data
d <- data.frame(user_id = 1:12, choice = rep(c("apple", "banana", "banana"),4), cond1 = c("a1", "a1", "a2", "a1", "a2", "a1", "a2", "a2", "a3", "a3", "a3", "a3"), gender = c(rep(c("F", "M"), 6)))

# list of the different grouping combinations (choice will always be one of these terms)
c1 <- list(c("choice", "cond1"), c("choice", "cond1","gender"))  

# variable to be aggregated over
v1 <- c("user_id")    

varNames <- c("cond1Ct", "cond1_GenderCt")         # for later use

counts <- mapply(FUN= function(z,y, ...) {aggregate(d[ , y], by=d[z], ...)},
       c1, v1, MoreArgs=list(FUN=length) )

# assign each part of the output to its own dataframe for later use
cond1 <- counts[[1]]
cond1_Gender <- counts[[2]]

So this works so far.

As seen with Name columns within aggregate in R, it is possible to assign a column name to the variable you are aggregating over. I couldn't get this to work in the mapply function call. That is my primary question here --- how can I automate the naming of the tabulated variables for each grouping combination.

My brute force method of doing it:

# rename the column labeled 'x' in a meaningful way ... should be possible to do using varnames (defined above)

cond1_Gender$cond1_Gender_Ct <- cond1_Gender$x          
cond1$cond1_Ct <- cond1$x
cond1_Gender$x <- NULL
cond1$x <- NULL

cond1_Gender

# bring back the now meaningfuly named columns into the original dataframe    
dUse <- merge(merge(d, cond1), cond1_Gender)  
     # efficiency of this for large dataset and 8+ aggregated datasets???



dUse_Cond <- dUse[!duplicated(dUse[c("choice", "cond1")]),]   # later: use item from list here!
# has extra columns that just get ignored ... but that could be potentially confusing for later users

# repeat for all other grouping variable combinations

# dUse_CondGender <- dUse[!duplicated(dUse[c1[[2]],])]   # (DOES NOT WORK!)

How can I accomplish this naming in a more automated fashion so it can work across datafiles and multiple variables?

Bigger Picture: This is only part of a larger situation in which I'm wanting to graph the frequency of the top choices for each grouping combination. I have code that determines the most highly rated occupations for each grouping combination and plots those frequencies, for each of the levels in a grouping category.


ADDED 3-25-14: Clarification of the context in which this arises, although likely beyond the scope of this question.

Individuals have saved a choice (out of a long list). I want to know (graphically & numerically), which choices were the "most frequent" (and their corresponding freq) for different combinations of background grouping variables: i.e., most frequent choices for 5th grade males, or most freq choices for 10th graders in state X, or for all high school students in the treatment condition. I have code that takes me through each individual background grouping: tabulating (as worked on in this problem), finding the 'most frequent' choices, computing statistics on freq for that group, and graphing based on the variables involved in that group. Overall, I'm trying to make this more generic and vectorized so it works more easily with a subsequent user defining what combinations of background variables they want to look at. jlhoward's answer adds count columns for each of the desired combinations to the original dataframe.


Thanks! I hope this makes sense ... I've read a lot of questions and answers here, but haven't tried to post before.

Community
  • 1
  • 1
jc7
  • 113
  • 1
  • 8
  • Writing up this question also brings me back to a larger issue about how to structure the data in the first place - so I'm open to suggestions on that as well, as this is potentially a clumsy way of doing this. I've gone back and forth on (1) do I want all of these different groupings of aggregation brought back into my original dataframe OR (2) do I keep them in separate dataframes, potentially ending up with a list of dataframes. Either case involves the issue of assigning names (either to variables and dataframes or dataframes), ideally from a list. Option 1 is what I'm working on above. – jc7 Mar 22 '14 at 18:55

3 Answers3

0

I think this can be simplified greatly using the table function (and class) with its as.data.frame method producing an object suitable for merging:

counts <- lapply(c1, function(cond) { as.data.frame( table(d[cond]))}) 
# That returns two 'Freq' vectors (named in the as.dataframe` step) in a list. 

d[order(d[2],d[3],d[4]), varNames] <- lapply(counts, function( cts) {
               merge(d[order(d[2],d[3],d[4]), ], cts )[['Freq']] })
#Could also have `cbind`-ed it. The `d[names] <-` assigned the names. 
#Could also have used `setNames` on the RHS. 

#------------

> d
   user_id choice cond1 gender cond1Ct cond1_GenderCt
1        1  apple    a1      F       2              1
2        2 banana    a1      M       2              2
3        3 banana    a2      F       3              2
4        4  apple    a1      M       2              1
5        5 banana    a2      F       3              2
6        6 banana    a1      M       2              2
7        7  apple    a2      F       1              1
8        8 banana    a2      M       3              1
9        9 banana    a3      F       3              2
10      10  apple    a3      M       1              1
11      11 banana    a3      F       3              2
12      12 banana    a3      M       3              1

I will admit that I kind of ran down a dead-end rabbit hole trying to get the ave function to deliver the count vectors, but it did not accept a list argument to its indexing arguments. I reviewed an earlier function I had developed and saw that table does accept a list. My second admission is that I did not realize that assignment to an ordered position would not reorder the original object:

> a <- 10:1
> a[order(a)][2] <-100
> a
 [1]  10   9   8   7   6   5   4   3 100   1  # surprised me anyway.

The as.data.frame method for table-objects just creates a "long" dataframe from the table entries with the Freq column holding the counts:

 as.data.frame( table(d[-(1:3)]) )
#-----------------------
   gender cond1Ct cond1_GenderCt Freq
1       F       1              1    1
2       M       1              1    1
3       F       2              1    1
4       M       2              1    1
5       F       3              1    1
6       M       3              1    1
7       F       1              2    0
8       M       1              2    0
9       F       2              2    1
10      M       2              2    1
11      F       3              2    2
12      M       3              2    2
> table(d[-(1:3)]) 
, , cond1_GenderCt = 1

      cond1Ct
gender 1 2 3
     F 1 1 1
     M 1 1 1

, , cond1_GenderCt = 2

      cond1Ct
gender 1 2 3
     F 0 1 2
     M 0 1 2
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Wow - amazingly simple and works beautifully! I'll have to play around with the as.data.frame(table(d[cond])) bit to understand how it works. I can understand dead-ends ... I have been down many, many dead-ends with this project but keep learning new ways of doing something. – jc7 Mar 22 '14 at 23:59
  • So as I tried to use this with real data, encountered a different problem, and returned to work with your code further, I found the counts aren't the same in the merged dataframe once I've applied the variable names. The counts themselves are correct. But then, when the labels are applied, as seen in your output, those numbers aren't maintained --- banana a2 on line 5 says that combination occurs 2 times; on line 8, it says that combination occurs 3 times. Last col also inconsistent with the actual data. My tries at separating naming from merging aren't working so far ... Any suggestions? – jc7 Mar 23 '14 at 01:12
  • I see I didn't correct an error in arguments to the `merge` call ... thought I had pasted in that correction. See if it works now. – IRTFM Mar 23 '14 at 03:38
  • Thanks for still helping with this. I see that you switched it to be recursive -- do you mind explaining why?. `counts` is still correct. `d_Merged_Sort <- d[with(d, order(choice, cond1, gender)), ]` shows that things are still getting mixed up. – jc7 Mar 23 '14 at 15:06
  • I wouldn't have called what I did recursive. It was a serial merger of two different data.frames with the original `d` dataframe. is there a particular category you think is incorrectly matched. – IRTFM Mar 23 '14 at 16:23
  • If I look at `counts[[1]]`, I see that apple in cond1 == a1 has a frequency of 2, which is correct. When I look at `d_Merged_Sort` (or just d), I see that apple in cond1 == a1 has first a frequency of 2 (when F) and a frequency of 1 (when M). Those should both be 2 (as seen in the counts dataframe). Similarly, take a look at banana, cond == a2. The counts vary both within cond1Ct and within a given gender for cond1_GenderCt, where it should be repetition of the value from the `counts`. Do you see that when you run it also? – jc7 Mar 23 '14 at 16:44
  • I do see that the two lines of apple/a1 banana/a2 should be that same in cond1Ct and are not. Will try to understand how that is happening. I suspect that different ordering of the two merge operations is the problem. – IRTFM Mar 23 '14 at 17:02
  • Ordering both sides of hte assignment fixes the problem... I think. – IRTFM Mar 23 '14 at 17:20
  • It seems like this may be getting more specific for this case, and thus hurting how generalizable it is; e.g., how would it handle different number of grouping variables? Mine is complaining about subscript out of bounds. It seems like perhaps a combination of your code and the code of @jlhoward will be the way to go. I've definitely learned some new things ....... The for loop used there re-names the Freq column and I will follow up on that comment with questions about how to merge it all together. `lapply(counts, function( cts) merge(d, cts))` merges within each list ... – jc7 Mar 23 '14 at 18:26
0

Your call to mapply(...) is more complex than it needs to be:

counts   <- mapply(FUN= function(z,y) aggregate(d[ , y], by=d[z], FUN=length),
                      c1, "user_id")

If all you want to do is automate renaming the column x in each dataframe to something else, this will work:

# rename all "x" columns
for (i in 1:length(counts)) 
  colnames(counts[[i]])[ncol(counts[[i]])]<- varNames[i]

To address your core issue, I'd need to see an example of what you mean by "graph the frequency of the top choices for each grouping combination."

EDIT (Response to OP's comment)

If your intermediate goal is to combine everything into a singe data frame, then there is an easier way. Note that this leaves the aggregated columns named as x until the end.

counts   <- mapply(FUN= function(z,y) aggregate(d[ , y], by=d[z], FUN=length),
                   c1, "user_id")
mrg <- lapply(counts,function(df)merge(d,df)[,c("user_id","x")])
mrg <- do.call(cbind,lapply(mrg,function(df)merge(d,df,by="user_id")$x))
colnames(mrg) <- varNames
result <- cbind(d,mrg)
result
#    user_id choice cond1 gender cond1Ct cond1_GenderCt
# 1        1  apple    a1      F       2              1
# 2        2 banana    a1      M       2              2
# 3        3 banana    a2      F       3              2
# 4        4  apple    a1      M       2              1
# 5        5 banana    a2      F       3              2
# 6        6 banana    a1      M       2              2
# 7        7  apple    a2      F       1              1
# 8        8 banana    a2      M       3              1
# 9        9 banana    a3      F       3              2
# 10      10  apple    a3      M       1              1
# 11      11 banana    a3      F       3              2
# 12      12 banana    a3      M       3              1

The first use of lapply(...)

mrg <- lapply(counts,function(df)merge(d,df)[,c("user_id","x")])

creates a list of data frames wherein each associates user_id with the count for the appropriate combination. Then,

mrg <- do.call(cbind,lapply(mrg,function(df)merge(d,df,by="user_id")$x))

combines the x column from each into a single data frame ordered properly by user_id. Finally,

result <- cbind(d,mrg)

combines the columns with the original data frame d, which is already in user_id order.

Again, it would be much better to understand your ultimate goal, as there is almost certainly a way to achieve it without going through all this.

jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • Couldn't the second step have been `names(counts) <- varNames`? – IRTFM Mar 22 '14 at 22:22
  • Thanks jlhoward ... this works as well. I marked @IShouldBuyABoat 's as the answer bc it seemed more straightforward for my current approach, but I like how you go the names onto the separate dataframes and will hang onto that code. – jc7 Mar 23 '14 at 00:00
  • @IShouldBuyABoat when I tried your second step, it renamed the names of the items of each list (not sure on terminology there) and not the vector labeled x within each dataframe in the list. – jc7 Mar 23 '14 at 00:04
  • I liked how the other response merged it all back together, but I got stuck trying to figure out some issues with re-ordering. Your for loop nicely names it in each dataframe (the direct question), and I'm now merging the lists that are in counts into a single dataframe. That's a separate issue and there are other stackoverflow posts on it, although I haven't been able to get them to work yet ... After the for loop, I'm using `d2 <- lapply(counts, function( cts) merge(d, cts))` and then `Reduce(function(x,y) merge(x,y,by="user_id"),d2)` and just need to get rid of the duplicated columns. – jc7 Mar 23 '14 at 18:45
  • That works! And it transferred to my actual data. A couple notes for others' future use of this code: (1) you have to be careful about where NAs are (and previously handle them) so that all columns are the same length in the 2nd change to mrg and (2) each row has to have a unique identifier. (These both came up when I applied it to my data.) Yay! Moving forwards to the next step as checking the results seems to give the expected behavior so far. ------ In terms of the bigger picture, I added a section at the end of my original post, but I think that likely goes beyond the scope of this ques. – jc7 Mar 25 '14 at 21:06
0

Using the package plyr seems to greatly simplify the code and handle both grouping variables that have missing values and instances where one id has multiple choices (both of these came up when I took this back to the larger dataset).

library (plyr)
d2 <- data.frame(user_id = 1:12, choice = rep(c("apple", "banana", "banana"),4), 
                 cond1 = c("a1", "a1", "a2", "a1", "a2", "a1", "a2", "a2", "a3", "a3", "a3", "a3"), 
                 gender = c(rep(c("F", "M"), 6)))

d2$user_id[7] <- 5         # modify the dataset some
d2$gender[10] <- NA

tmp1 <- ddply(d2, ~cond1 + gender + choice, summarize, cond1_GenderCt = length(choice))     
tmp2 <- ddply(d2, ~cond1 + choice, summarize, cond1Ct = length(choice))     
result2 <- merge (tmp2, merge(tmp1, d2))
result2

This creates one dataframe with named variables that has brought back in the frequencies of each choice within each set of grouping variables.

EDIT: So I apparently forgot the main point of my own question! Handling different combinations of variables.

doddply <- function(df, x){
  ddply(df,x,summarize,nChoice = length(choice))
}

lapply (c2, function (x) {doddply(d2, x)})

It would seem like a variant on the doddply function above that takes a varNames list as well as the source of "nChoice" and is called by mapply would help, but I couldn't get that to work.

So this ends up exactly the same as @jlhoward's solution ... the code there after the counts variable is still what is needed for naming and merging. (I'm leaving this here though as just another way to get to that point).

jc7
  • 113
  • 1
  • 8