Assume following table
Name Gender Place Age V1
Tom M NY 24 A
Nadia F AT 22 A
Alex M DE 42 B
Jodie F OH 18 B
Tom M NY 28 B
Alex F ID 32 B
Nadia F AT 34 A
Tom M OH 18 A
I want to group the table by name and gender, replacing the place and V1 using a majority vote of the joined columns and the age by the numerical mean. The result should be:
Name Gender Place Age V1
Tom M NY 23.3334 A
Nadia F AT 28 A
Alex M DE 42 B
Jodie F OH 18 B
Alex F ID 32 B
There are three entries for Tom (M) with the place being two times NY and once OH. Per majority vote NJ is more often and thus is chosen. Same for A in V1. The mean of the ages (24, 28 and 18) is 23.3334.
I got the numerical mean working using dplyr:
dt <- dt %>%
group_by_(.dots=lapply(names(dt)[c(1, 2)], as.symbol)) %>%
summarise_each(funs(mean))
And can do the majority vote on place and V1 seperate:
dt$place<- dt[, names(which.max(table(place))), by = paste(name, gender)]
dt$V1 <- dt[, names(which.max(table(V1))), by = paste(name, gender)]
My problem with this is performance. I have a very large dataset and these modifications in multiple steps take too long. It would be great to at least use some sort of apply function to do the majority vote in one step. The best would be to add the majority vote into the dplyr function.