1

Assume following table

Name    Gender  Place Age V1
Tom     M       NY    24  A
Nadia   F       AT    22  A
Alex    M       DE    42  B
Jodie   F       OH    18  B
Tom     M       NY    28  B
Alex    F       ID    32  B
Nadia   F       AT    34  A
Tom     M       OH    18  A

I want to group the table by name and gender, replacing the place and V1 using a majority vote of the joined columns and the age by the numerical mean. The result should be:

Name    Gender  Place Age      V1
Tom     M       NY    23.3334  A
Nadia   F       AT    28       A
Alex    M       DE    42       B
Jodie   F       OH    18       B
Alex    F       ID    32       B

There are three entries for Tom (M) with the place being two times NY and once OH. Per majority vote NJ is more often and thus is chosen. Same for A in V1. The mean of the ages (24, 28 and 18) is 23.3334.

I got the numerical mean working using dplyr:

dt <- dt %>%
    group_by_(.dots=lapply(names(dt)[c(1, 2)], as.symbol)) %>%
    summarise_each(funs(mean))

And can do the majority vote on place and V1 seperate:

dt$place<- dt[, names(which.max(table(place))), by = paste(name, gender)]
dt$V1 <- dt[, names(which.max(table(V1))), by = paste(name, gender)]

My problem with this is performance. I have a very large dataset and these modifications in multiple steps take too long. It would be great to at least use some sort of apply function to do the majority vote in one step. The best would be to add the majority vote into the dplyr function.

  • I use the lappy part because I want to make a script of it and won't know the column names at runtime. Therefore, I get the symbols of the names. Might be a bit overcomplicated but works :) Any improvements are appreciated! Also, my actual dataset has by far more numeric columns which is why I am using summarise_each, but you are right. For this example summarise would be sufficient. Thank you! – Sebastian Hätälä Aug 29 '15 at 08:16
  • @VeerendraGadekar I think you may need to split up the numeric, non-numeric columns, do `summarise_each` separately by the grouping columns, and then `bind_cols` or `join` would be sufficient. – akrun Aug 29 '15 at 08:23
  • 3
    I guess `Nadja` is a typo of `Nadia` – akrun Aug 29 '15 at 08:26

2 Answers2

5

We create a vector of grouping column names ('grpCol'), use setdiff to get the rest of the column names ('nm1'). Loop (sapply) though the 'nm1' columns to check which of these columns are 'numeric' (is.numeric) to return a logical index ('indx').

grpCol <- c('Name', 'Gender')
nm1 <- setdiff(names(df1), grpCol)
indx <- sapply(df1[nm1], is.numeric)

We also create a Mode function to return the element with the maximum frequency.

Mode <- function(x) {
 ux <- unique(x)
 ux[which.max(tabulate(match(x, ux)))]
}

Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by the 'grpCol', we loop through a subset of Subset of Data.table (.SD) using the 'indx' to return mean for numeric columns and mode for non-numeric columns, concatenate (c) to get the expected output.

setDT(df1)[,c(lapply(.SD[, names(indx)[indx], with=FALSE], mean),
      lapply(.SD[, names(indx)[!indx], with=FALSE], Mode)) , 
               by = grpCol]
#   Name Gender      Age Place V1
#1:   Tom      M 23.33333    NY  A
#2: Nadia      F 28.00000    AT  A
#3:  Alex      M 42.00000    DE  B
#4: Jodie      F 18.00000    OH  B
#5:  Alex      F 32.00000    ID  B

Or as @Frank mentioned in the comments, we can do the if/else condition within the lapply instead of creating 'indx'.

setDT(df1)[, lapply(.SD, function(x) {if(is.numeric(x)) mean(x) 
                else Mode(x)} ),  by=.(Name,Gender)]
#    Name Gender Place      Age V1
#1:   Tom      M    NY 23.33333  A
#2: Nadia      F    AT 28.00000  A
#3:  Alex      M    DE 42.00000  B
#4: Jodie      F    OH 18.00000  B
#5:  Alex      F    ID 32.00000  B

data

df1 <- structure(list(Name = c("Tom", "Nadia", "Alex", "Jodie", "Tom", 
"Alex", "Nadia", "Tom"), Gender = c("M", "F", "M", "F", "M", 
"F", "F", "M"), Place = c("NY", "AT", "DE", "OH", "NY", "ID", 
"AT", "OH"), Age = c(24L, 22L, 42L, 18L, 28L, 32L, 34L, 18L), 
V1 = c("A", "A", "B", "B", "B", "B", "A", "A")), .Names = c("Name", 
"Gender", "Place", "Age", "V1"), class = "data.frame",
row.names = c(NA, -8L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Instead of two `lapply`s and the external objects, you could do `df1[, lapply(.SD, function(x) (if (is.numeric(x)) mean else Mode)(x) ), by=.(Name,Gender)]` – Frank Aug 29 '15 at 22:39
1

Here is the dplyr way

library(dplyr)

df1 %>% 
 group_by(Name, Gender) %>% 
 mutate(Age = mean(Age)) %>% 
 filter(Place == names(which.max(table(Place))) & 
           V1 == names(which.max(table(V1)))) %>% unique

#      Name Gender Place      Age V1
#1   Tom      M    NY 23.33333  A
#2 Nadia      F    AT 28.00000  A
#3  Alex      M    DE 42.00000  B
#4 Jodie      F    OH 18.00000  B
#5  Alex      F    ID 32.00000  B
Veerendra Gadekar
  • 4,452
  • 19
  • 24