Rowwise median for multiple columns using dplyr

Question

Given the following dataset, I want to compute for each row the median of the columns M1,M2 and M3. I am looking for a solution where the final column is added to the dataframe under the name 'Median'. The column names (M1:M3) should not be used directly (in the original dataset, there are many more columns, not just 3).

# A tibble: 8 x 5
 I1    M1    M2    I2    M3
<int> <int> <int> <int> <int>
1     3     4     5     3     5
2     2     2     2     2     1
3     2     2     2     2     2
4     3     1     3     3     1
5     2     1     3     3     1
6     3     2     4     4     3
7     3     1     3     4     1
8     2     1     3     2     3

You can load the dataset using:

df = structure(list(I1 = c(3L, 2L, 2L, 3L, 2L, 3L, 3L, 2L), M1 = c(4L, 
2L, 2L, 1L, 1L, 2L, 1L, 1L), M2 = c(5L, 2L, 2L, 3L, 3L, 4L, 3L, 
3L), I2 = c(3L, 2L, 2L, 3L, 3L, 4L, 4L, 2L), M3 = c(5L, 1L, 2L, 
1L, 1L, 3L, 1L, 3L)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L), .Names = c("I1", "M1", "M2", "I2", 
"M3"))

I know that several similar questions have already been asked. However, most solutions posted use rowMeans or rowSums. I'm looking for a solution where:

no 'row-function' can be used.
the solution is a simple dplyr solution

The reason for (2) is that I am teaching the 'tidyverse' to total beginners.

If no row function can be used, the `gather` approach could be used. Is that fine? — akrun, Dec 12 '17 at 13:16
Total beginners should be taught `apply(df[, paste0("M", 1:3)], 1, median)` — talat, Dec 12 '17 at 13:20
@akrun Thanks, but I am surprised though that there is no simpler way to achieve this. I thought this to be a pretty simple task. — beginneR, Dec 12 '17 at 13:21
The reason for this is that your data structure is considered "untidy" — talat, Dec 12 '17 at 13:23
@docendodiscimus Nonetheless it is still a very common data structure in my opinion. — beginneR, Dec 12 '17 at 13:26
There's also a very simple solution in R, as I commented above. It seems like you're expecting the dplyr/tidyverse to replace all base R functions but that is not the case — talat, Dec 12 '17 at 13:27
I'm a BIG `dplyr` and `tidyverse` fan, but I'd have to agree with @docendodiscimus on that. I'd recommend -at least- baseR and `tidyverse` side by side, especially for those simple tasks. Much more useful for the students to spot similarity in results and differences in syntax. — AntoniosK, Dec 12 '17 at 13:37
If the beginners are not listening, try with some sound effects i.e `beepr::beep(7)` :-) — akrun, Dec 12 '17 at 13:41
@docendodiscimus This is indeed a simple solution but I was just interested in whether there is a simple dplyr solution as well. I was hoping to find one in order to be able to stay within the tidyverse as long as possible without having to teach both. — beginneR, Dec 12 '17 at 13:41
@beginneR, apropo of nothing, you may want to check out [cseducators.se]. It sounds like it might be a useful site for you. — Ben I., Dec 20 '17 at 14:19

akrun · Accepted Answer · 2017-12-12T13:32:12.370

5

We could use rowMedians

library(matrixStats)
library(dplyr)
df %>% 
    mutate(Median = rowMedians(as.matrix(.[grep('M\\d+', names(.))])))

Or if we need to use only tidyverse functions, convert it to 'long' format with gather, summarize by row and get the median of the 'value' column

df %>% 
    rownames_to_column('rn') %>%
    gather(key, value, starts_with('M')) %>%
    group_by(rn) %>% 
    summarise(Median = median(value)) %>%
    ungroup %>% 
    select(-rn) %>%
    bind_cols(df, .)

Or another option is rowwise() from dplyr (hope the row is not a problem)

df %>% 
   rowwise() %>% 
   mutate(Median =  median(c(!!! rlang::syms(grep('M', names(.), value=TRUE)))))

edited Dec 12 '17 at 13:32

answered Dec 12 '17 at 13:04

akrun

874,273
37
540
662

Thanks, but a solution without a `row...()` function would be even better for me. – beginneR Dec 12 '17 at 13:09
@beginneR I thought you want a similar function like `rowMeans` etc as it was mentioned in the post – akrun Dec 12 '17 at 13:11
@beginneR Otherwise, you can do the `gather` way i.e. .`df %>% rownames_to_column('rn') %>% gather(key, value, starts_with('I')) %>% group_by(rn) %>% summarise(Median = median(value)) %>% ungroup %>% select(-rn) %>% bind_cols(df, .)` – akrun Dec 12 '17 at 13:14
1

@AntoniosK Thank you for the comment. I thought it was the reverse. corrected – akrun Dec 12 '17 at 13:33
1

I hope that OP is not going to actually teach these approaches to beginners. No offence to you akrun, but dplyr is simply not made for this – talat Dec 12 '17 at 13:33
@docendodiscimus Sure, it is a bit tricky. The rowwise is a bit more concise than the second one, though, the use of `syms` and `!!!` could scare them – akrun Dec 12 '17 at 13:35

Frank · Answer 2 · 2021-04-16T18:02:24.703

Given a dataframe df with some numeric values:

df <- structure(list(X0 = c(0.82046171427112, 0.836224720981912, 0.842547521493854, 
0.848014287631906, 0.850943494153631, 0.85425398956647, 0.85616876970771, 
0.856855792247478, 0.857471048654811, 0.857507363153284, 0.874487063791594, 
1.70684558846347, 1.95711031206168, 6.84386713155156), X1 = c(0.755674148966666, 
0.765242580861224, 0.774422478168495, 0.776953642833977, 0.778128315184819, 
0.778611604461183, 0.778624581647491, 0.778454002430202, 1.52708579075974, 
13.0356519295685, 18.0590093408357, 21.1371199340156, 32.4192814934364, 
33.2355314147089), X2 = c(0.772236670327724, 0.788112332251601, 
0.797695511542613, 0.804257521548174, 0.809815828400878, 0.816592605516508, 
0.819421106011397, 0.821734473885381, 0.822561946509595, 0.822334970491528, 
0.822404634095793, 2.66875340820162, 1.40412743557514, 6.33377768022403
), X3 = c(0.764363881671609, 0.788288196346034, 0.79927498357549, 
0.805446784334039, 0.810604881970155, 0.814634331592811, 0.817002594424753, 
0.818129844752095, 0.818572101954132, 0.818630700031836, 3.06323952591121, 
6.4477868357554, 11.4657041958038, 9.27821049066848)), class = "data.frame", row.names = c(NA, 
-14L))

One can easily compute row-wise median using base R like so:

df$median <- sapply(
    seq(nrow(df)), 
    function(i) df[i, 1:4] %>% unlist %>% median
)

Above I select columns manually with numeric range, but to satisfy the dplyr requirement you can use dplyr::select() to choose your columns:

df$median <- sapply(
    df %>% nrow %>% seq, 
    function(i) df[i, ] %>% 
        dplyr::select(X1, X2) %>% 
        unlist %>% median
)

I like this method because you don't have to search for different functions to calculate anything.

For example, standard deviation:

df$sd <- sapply(
    df %>% nrow %>% seq, 
    function(i) df[i, ] %>% 
        dplyr::select(X1, X2) %>% 
        unlist %>% sd
)

score 0 · Answer 3 · answered Apr 07 '23 at 21:30

dplyr now includes the c_across function that works with rowwise to enable the use of select helpers, like starts_with, ends_with, all_of and where(is.numeric). This makes it very useful for median as well as max, min or custom functions. Examples below use the df provided by the original asker.

To use a preselected character vector containing column names:

  useCols <- paste0("M", 1:3)
  newDf<-df%>%
     rowwise() %>%
     mutate(med = median(c_across(all_of(useCols))))

Or to select columns programmatically using column names, combine with starts_with, ends_with, contains, matches and num_range:

newDf<-df%>%
     rowwise() %>%
     mutate(med = median(c_across(starts_with("M"))))

Or to select columns based on content, combine with where:

 newDf<-df%>%
     rowwise() %>%
     mutate(med = median(c_across(where(~is.numeric(.x) && max(.x) == 5))))

Rowwise median for multiple columns using dplyr

3 Answers3

Linked