R apply mean for numerical columns and majority vote on categorical

Question

Assume following table

Name    Gender  Place Age V1
Tom     M       NY    24  A
Nadia   F       AT    22  A
Alex    M       DE    42  B
Jodie   F       OH    18  B
Tom     M       NY    28  B
Alex    F       ID    32  B
Nadia   F       AT    34  A
Tom     M       OH    18  A

I want to group the table by name and gender, replacing the place and V1 using a majority vote of the joined columns and the age by the numerical mean. The result should be:

Name    Gender  Place Age      V1
Tom     M       NY    23.3334  A
Nadia   F       AT    28       A
Alex    M       DE    42       B
Jodie   F       OH    18       B
Alex    F       ID    32       B

There are three entries for Tom (M) with the place being two times NY and once OH. Per majority vote NJ is more often and thus is chosen. Same for A in V1. The mean of the ages (24, 28 and 18) is 23.3334.

I got the numerical mean working using dplyr:

dt <- dt %>%
    group_by_(.dots=lapply(names(dt)[c(1, 2)], as.symbol)) %>%
    summarise_each(funs(mean))

And can do the majority vote on place and V1 seperate:

dt$place<- dt[, names(which.max(table(place))), by = paste(name, gender)]
dt$V1 <- dt[, names(which.max(table(V1))), by = paste(name, gender)]

My problem with this is performance. I have a very large dataset and these modifications in multiple steps take too long. It would be great to at least use some sort of apply function to do the majority vote in one step. The best would be to add the majority vote into the dplyr function.

I use the lappy part because I want to make a script of it and won't know the column names at runtime. Therefore, I get the symbols of the names. Might be a bit overcomplicated but works :) Any improvements are appreciated! Also, my actual dataset has by far more numeric columns which is why I am using summarise_each, but you are right. For this example summarise would be sufficient. Thank you! — Sebastian Hätälä, Aug 29 '15 at 08:16
@VeerendraGadekar I think you may need to split up the numeric, non-numeric columns, do `summarise_each` separately by the grouping columns, and then `bind_cols` or `join` would be sufficient. — akrun, Aug 29 '15 at 08:23

akrun · Accepted Answer · 2015-08-30T02:30:53.373

We create a vector of grouping column names ('grpCol'), use setdiff to get the rest of the column names ('nm1'). Loop (sapply) though the 'nm1' columns to check which of these columns are 'numeric' (is.numeric) to return a logical index ('indx').

grpCol <- c('Name', 'Gender')
nm1 <- setdiff(names(df1), grpCol)
indx <- sapply(df1[nm1], is.numeric)

We also create a Mode function to return the element with the maximum frequency.

Mode <- function(x) {
 ux <- unique(x)
 ux[which.max(tabulate(match(x, ux)))]
}

Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by the 'grpCol', we loop through a subset of Subset of Data.table (.SD) using the 'indx' to return mean for numeric columns and mode for non-numeric columns, concatenate (c) to get the expected output.

setDT(df1)[,c(lapply(.SD[, names(indx)[indx], with=FALSE], mean),
      lapply(.SD[, names(indx)[!indx], with=FALSE], Mode)) , 
               by = grpCol]
#   Name Gender      Age Place V1
#1:   Tom      M 23.33333    NY  A
#2: Nadia      F 28.00000    AT  A
#3:  Alex      M 42.00000    DE  B
#4: Jodie      F 18.00000    OH  B
#5:  Alex      F 32.00000    ID  B

Or as @Frank mentioned in the comments, we can do the if/else condition within the lapply instead of creating 'indx'.

setDT(df1)[, lapply(.SD, function(x) {if(is.numeric(x)) mean(x) 
                else Mode(x)} ),  by=.(Name,Gender)]
#    Name Gender Place      Age V1
#1:   Tom      M    NY 23.33333  A
#2: Nadia      F    AT 28.00000  A
#3:  Alex      M    DE 42.00000  B
#4: Jodie      F    OH 18.00000  B
#5:  Alex      F    ID 32.00000  B

data

df1 <- structure(list(Name = c("Tom", "Nadia", "Alex", "Jodie", "Tom", 
"Alex", "Nadia", "Tom"), Gender = c("M", "F", "M", "F", "M", 
"F", "F", "M"), Place = c("NY", "AT", "DE", "OH", "NY", "ID", 
"AT", "OH"), Age = c(24L, 22L, 42L, 18L, 28L, 32L, 34L, 18L), 
V1 = c("A", "A", "B", "B", "B", "B", "A", "A")), .Names = c("Name", 
"Gender", "Place", "Age", "V1"), class = "data.frame",
row.names = c(NA, -8L))

Instead of two `lapply`s and the external objects, you could do `df1[, lapply(.SD, function(x) (if (is.numeric(x)) mean else Mode)(x) ), by=.(Name,Gender)]` — Frank, Aug 29 '15 at 22:39

Veerendra Gadekar · Answer 2 · 2015-08-29T08:45:11.797

1

Here is the dplyr way

library(dplyr)

df1 %>% 
 group_by(Name, Gender) %>% 
 mutate(Age = mean(Age)) %>% 
 filter(Place == names(which.max(table(Place))) & 
           V1 == names(which.max(table(V1)))) %>% unique

#      Name Gender Place      Age V1
#1   Tom      M    NY 23.33333  A
#2 Nadia      F    AT 28.00000  A
#3  Alex      M    DE 42.00000  B
#4 Jodie      F    OH 18.00000  B
#5  Alex      F    ID 32.00000  B

edited Aug 29 '15 at 08:45

answered Aug 29 '15 at 08:15

Veerendra Gadekar

4,452
19
24

1

I think the OP want to do the `max(table` on two columns – akrun Aug 29 '15 at 08:16
1

@akrun have edited the answer, can you have a look at it – Veerendra Gadekar Aug 29 '15 at 08:52
1

Looks good, but if there are multiple columns, you may have to add the `summarise_each` approach I mentioned earlier. – akrun Aug 29 '15 at 10:13

R apply mean for numerical columns and majority vote on categorical

2 Answers2

data