2

I am trying to calculate the median (but that could be substituted by similar metrics) by group for multiple columns based on subsets defined by other columns. This is direct follow-on question from this previous post of mine. I have attempted to incorporate calculating the median via aggregate into the Map(function(x,y) dosomething, x, y) solution kindly provided by @Frank, but that didn't work. Let me illustrate:

Calculate median for A and B by groups GRP1 and GRP2

df <- data.frame(GRP1 = c("A","A","A","A","A","A","B","B","B","B","B","B"), GRP2 = c("A","A","A","B","B","B","A","A","A","B","B","B"), A = c(0,4,6,7,0,1,9,0,0,8,3,4), B = c(6,0,4,8,6,7,0,9,9,7,3,0))

med <- aggregate(.~GRP1+GRP2,df,FUN=median)

Simple. Now add columns defining which rows to be used for calculating the median, i.e. rows with NAs should be dropped, column a defines which rows to be used for calculating the median in column A, same for columns b and B:

a <- c(1,4,7,3,NA,3,7,NA,NA,4,8,1)
b <- c(5,NA,7,9,5,6,NA,8,1,7,2,9)
df1 <- cbind(df,a,b)

As mentioned above, I have tried combining Map and aggregate, but that didn't work. I assume that Map doesn't know what to do with GRP1 and GRP2.

med1 <- Map(function(x,y) aggregate(.~GRP1+GRP2,df1[!is.na(y)],FUN=median), x=df1[,3:4], y=df1[, 5:6])

This is the result I'm looking for:

  GRP1 GRP2 A B
1    A    A 4 5
2    B    A 9 9
3    A    B 4 7
4    B    B 4 3

Any help will be much appreciated!

tjebo
  • 21,977
  • 7
  • 58
  • 94
M.Teich
  • 575
  • 5
  • 22
  • 2
    Could do the same as Frank showed you in the previous post: `f <- function(x, y) median(x[!is.na(y)]) ; df1[, Map(f, .SD[, 1:2], .SD[, 3:4]), by = .(GRP1, GRP2)]` – David Arenburg Aug 29 '18 at 08:45
  • @DavidArenburg This would be a very concise answer, as I don't have to code for every column separately. Unfortunately, I get the following error message: `> df1[, Map(f, .SD[, 1:2], .SD[, 3:4]), by = .(GRP1, GRP2)] Error in `[.data.frame`(df1, , Map(f, .SD[, 1:2], .SD[, 3:4]), by = .(GRP1, : unused argument (by = .(GRP1, GRP2))` – M.Teich Aug 29 '18 at 08:57
  • You need to convert `df1` to a data.table ofcourse, i.e. `setDT(df1)` – David Arenburg Aug 29 '18 at 08:58
  • Of course - thanks! – M.Teich Aug 29 '18 at 09:00
  • @DavidArenburg At the risk of exposing my ignorance: why does `.SD[, 1:2] and .SD[, 3:4]` code for columns 3:4 and 5:6, respectively? – M.Teich Aug 29 '18 at 09:16
  • Because we used the first 2 columns in the `by` statement so `.SD` is unaware of them anymore. – David Arenburg Aug 29 '18 at 09:24

2 Answers2

2

Using data.table

library(data.table)
setDT(df1)

df1[, .(A = median(A[!is.na(a)]), B = median(B[!is.na(b)])), by = .(GRP1, GRP2)]

   GRP1 GRP2 A B
1:    A    A 4 5
2:    A    B 4 7
3:    B    A 9 9
4:    B    B 4 3

Same logic in dplyr

library(dplyr)
df1 %>%
  group_by(GRP1, GRP2) %>%
  summarise(A = median(A[!is.na(a)]), B = median(B[!is.na(b)]))

The original df1:

df1 <- data.frame(
  GRP1 = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
  GRP2 = c("A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B"),
  A = c(0, 4, 6, 7, 0, 1, 9, 0, 0, 8, 3, 4),
  B = c(6, 0, 4, 8, 6, 7, 0, 9, 9, 7, 3, 0),
  a = c(1, 4, 7, 3, NA, 3, 7, NA, NA, 4, 8, 1),
  b = c(5, NA, 7, 9, 5, 6, NA, 8, 1, 7, 2, 9)
)
s_baldur
  • 29,441
  • 4
  • 36
  • 69
1

With dplyr:

library(dplyr)
df1 %>% 
  mutate(A = ifelse(is.na(a), NA, A),
         B = ifelse(is.na(b), NA, B)) %>% 
# I use this to put as NA the values we don't want to include
  group_by(GRP1, GRP2) %>% 
  summarise(A = median(A, na.rm = T),
            B = median(B, na.rm = T))

# A tibble: 4 x 4
# Groups:   GRP1 [?]
  GRP1  GRP2      A     B
  <fct> <fct> <dbl> <dbl>
1 A     A         4     5
2 A     B         4     7
3 B     A         9     9
4 B     B         4     3
RLave
  • 8,144
  • 3
  • 21
  • 37