1

Given a dataframe,

ID <- c("a","b","b","c","c","c","d","d","d")
dummy1 <- c(1,0,1,1,0,0,1,1,0)
dummy2 <- c(0,0,0,0,1,1,1,1,1)
dummy3 <- c(1,0,0,1,1,0,0,1,1)
df <- data.frame(ID,dummy1,dummy2,dummy3)

  ID dummy1 dummy2 dummy3
1  a      1      0      1
2  b      0      0      0
3  b      1      0      0
4  c      1      0      1
5  c      0      1      1
6  c      0      1      0
7  d      1      1      0
8  d      1      1      1
9  d      0      1      1

I want to calculate the mean for each variable in a set of multiple dummy variables.

It would be like using tapply, aggregate or an ave(x,y,mean) function on multiple rows, creating a new variable/column at the same time. Unfortunately, I don't know the number of dummy variables in advance. The only thing I know is that the dummy variables start in column 2. My result would look like this:

ID     m_dummy1  m_dummy2  m_dummy3   m_dummy5...
a      1         0         1
b      0         0         0
c      0.33      0.66      0.66
d      0.66      1         0.66

or like this:

ID     m_dummy1  m_dummy2  m_dummy3   m_dummy5...
a ...  1         0         1
b ...  0         0         0
b ...  0         0         0    
c ...  0.33      0.66      0.66
c ...  0.33      0.66      0.66
c ...  0.33      0.66      0.66
d ...  0.66      1         0.66    
d ...  0.66      1         0.66
d ...  0.66      1         0.66

In my scenario, I have an unknown number of dummies from 1 to x, so I might have dummy2 only, but maybe I have "dummy1" and the fictional dummies "dummy5" and "dummy6". The perfect solution would allow me to create "m_dummy" columns for all columns after column 2. Therefore, it would also work if dummy3 was missing or there was an additional dummy4 dummy4 <- c(1,0,0,0,0,0,0,1,0)

rmuc8
  • 2,869
  • 7
  • 27
  • 36

2 Answers2

3

Here are a couple of data.table approaches that make it easy to do either of your alternatives:

library(data.table)

as.data.table(df)[, lapply(.SD, mean), by = ID]
#    ID    dummy1    dummy2    dummy3
# 1:  a 1.0000000 0.0000000 1.0000000
# 2:  b 0.5000000 0.0000000 0.0000000
# 3:  c 0.3333333 0.6666667 0.6666667
# 4:  d 0.6666667 1.0000000 0.6666667

as.data.table(df)[, names(df)[-1] := lapply(.SD, mean), by = ID][]
#    ID    dummy1    dummy2    dummy3
# 1:  a 1.0000000 0.0000000 1.0000000
# 2:  b 0.5000000 0.0000000 0.0000000
# 3:  b 0.5000000 0.0000000 0.0000000
# 4:  c 0.3333333 0.6666667 0.6666667
# 5:  c 0.3333333 0.6666667 0.6666667
# 6:  c 0.3333333 0.6666667 0.6666667
# 7:  d 0.6666667 1.0000000 0.6666667
# 8:  d 0.6666667 1.0000000 0.6666667
# 9:  d 0.6666667 1.0000000 0.6666667

The base R equivalents of the above would be:

aggregate(. ~ ID, df, mean)

and

df[-1] <- lapply(df[-1], function(x) ave(x, df[[1]], FUN = mean))
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
2

You could try summarise_each or mutate_each from dplyr

library(dplyr)
df %>% 
    group_by(ID) %>% 
    summarise_each(funs(mean), starts_with('dummy'))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • What a quick and helpful answer, thanks a lot! Is there a way to specify the first column (the second column) with this solution, instead of `starts_with('dummy')`? – rmuc8 Feb 19 '15 at 16:11
  • @MarkusRehm Yes, there are several options. Please check `?select` – akrun Feb 19 '15 at 16:13
  • I see the special functions, but none of them refers to a column number. I also tried things like `starts_with(df[[2]]))`, but none of them work. I could solve this issue somehow with `one_of()` but it is not ideal. I have no previous experience with `?select`, so I'd really appreciate a hint from you. – rmuc8 Feb 19 '15 at 16:35
  • @MarkusRehm Suppose you want columns 2 and 3, `df %>% group_by(ID) %>% summarise_each(funs(mean), 2:3)` – akrun Feb 19 '15 at 16:36
  • it would be `2:"last column"`, but I won't know in advance how many columns there are – rmuc8 Feb 19 '15 at 16:38
  • 1
    @MarkusRehm Try `df %>% group_by(ID) %>% summarise_each(funs(mean), 2:ncol(df))` – akrun Feb 19 '15 at 16:39