Count number of values in row using dplyr

Question

This question should have a simple, elegant solution but I can't figure it out, so here it goes:

Let's say I have the following dataset and I want to count the number of 2s present in each row using dplyr.

set.seed(1)
ID <- LETTERS[1:5]
X1 <- sample(1:5, 5,T)
X2 <- sample(1:5, 5,T)
X3 <- sample(1:5, 5,T)

df <- data.frame(ID,X1,X2,X3)
library(dplyr)

Now, the following works:

df %>%
  rowwise %>%
  mutate(numtwos = sum(c(X1,X2,X3) == 2))

But how do I avoid typing out all of the column names?

I know this is probably easier to do without dplyr, but more generally I want to know how I can use dplyr's mutate with multiple columns without typing out all the column names.

evan.oman · Accepted Answer · 2016-06-09T17:40:20.740

15

Try rowSums:

> set.seed(1)
> ID <- LETTERS[1:5]
> X1 <- sample(1:5, 5,T)
> X2 <- sample(1:5, 5,T)
> X3 <- sample(1:5, 5,T)
> df <- data.frame(ID,X1,X2,X3)
> df
  ID X1 X2 X3
1  A  2  5  2
2  B  2  5  1
3  C  3  4  4
4  D  5  4  2
5  E  2  1  4
> rowSums(df == 2)
[1] 2 1 0 1 1

Alternatively, with dplyr:

> df %>% mutate(numtwos = rowSums(. == 2))
  ID X1 X2 X3 numtwos
1  A  2  5  2       2
2  B  2  5  1       1
3  C  3  4  4       0
4  D  5  4  2       1
5  E  2  1  4       1

edited Jun 09 '16 at 17:40

answered Jun 09 '16 at 16:59

evan.oman

5,922
22
43

I mentioned that I specifically want to know how to do this with dplyr, even if it isn't the best solution. – C_Z_ Jun 09 '16 at 17:18
@C_Z_ see my most recent edit, I think it is the shortest `dplyr` solution – evan.oman Jun 09 '16 at 17:39
How exactly does `.` work? Is it like `.SD` in `data.table`? – ytk Jun 09 '16 at 18:50
I think `.` is just a way to reference the `df` you are mutating – evan.oman Jun 09 '16 at 19:00

Steven Beaupré · Answer 2 · 2016-06-09T18:24:16.183

Here's another alternative using purrr:

library(purrr)

df %>%
  by_row(function(x) {
    sum(x[-1] == 2) },
    .to = "numtwos",
    .collate = "cols"
  )

Which gives:

#Source: local data frame [5 x 5]
#
#      ID    X1    X2    X3 numtwos
#  <fctr> <int> <int> <int>   <int>
#1      A     2     5     2       2
#2      B     2     5     1       1
#3      C     3     4     4       0
#4      D     5     4     2       1
#5      E     2     1     4       1

As per mentioned in the NEWS, row based functionals are still maturing in dplyr:

We are still figuring out what belongs in dplyr and what belongs in purrr. Expect much experimentation and many changes with these functions.

Benchmark

We can see how rowwise() and do() compare to purrr::by_row() for this type of problem and how they "perform" against rowSums() and the tidy data way:

largedf <-  df[rep(seq_len(nrow(df)), 10e3), ]

library(microbenchmark)
microbenchmark(
  steven = largedf %>% 
    by_row(function(x) { 
      sum(x[-1] == 2) }, 
      .to = "numtwos", 
      .collate = "cols"),
  psidom = largedf %>% 
    rowwise %>% 
    do(data_frame(numtwos = sum(.[-1] == 2))) %>% 
    cbind(largedf, .),
  gopala = largedf %>% 
    gather(key, value, -ID) %>% 
    group_by(ID) %>% 
    summarise(numtwos = sum(value == 2)) %>% 
    inner_join(largedf, .),
  evan   = largedf %>% 
    mutate(numtwos = rowSums(. == 2)),
  times  = 10L,
  unit   = "relative"
)

Results:

#Unit: relative
#   expr         min          lq        mean      median         uq         max neval cld
# steven 1225.190659 1261.466936 1267.737126 1227.762573 1276.07977 1339.841636    10  b 
# psidom 3677.603240 3759.402212 3726.891458 3678.717170 3728.78828 3777.425492    10   c
# gopala    2.715005    2.684599    2.638425    2.612631    2.59827    2.572972    10 a  
#   evan    1.000000    1.000000    1.000000    1.000000    1.00000    1.000000    10 a

Purrrfect indeed ;) Although from recent experiments `by_row()` is painfuly slow for large dataset. — Steven Beaupré, Jun 09 '16 at 17:48
@StevenBeaupré cool comparison! Thanks for putting that together! — evan.oman, Jun 09 '16 at 19:05

score 6 · Answer 3 · answered Mar 08 '19 at 14:34

Just wanted to add to the answer of @evan.oman in case you only want to sum rows for specific columns, not all of them. You can use the regular select and/or select_helpers functions. In this example, we don't want to include X1 in rowSums:

df %>% 
  mutate(numtwos = rowSums(select(., -X1) == 2))

  ID X1 X2 X3 numtwos
1  A  2  5  2       1
2  B  2  5  1       0
3  C  3  4  4       0
4  D  5  4  2       1
5  E  2  1  4       0

score 2 · Answer 4 · answered Jun 09 '16 at 17:32

One approach is to use a combination of dplyr and tidyr to convert data into long format, and do the computation:

library(dplyr)
library(tidyr)
df %>%
  gather(key, value, -ID) %>%
  group_by(ID) %>%
  summarise(numtwos = sum(value == 2)) %>%
  inner_join(df, .)

Output is as follows:

  ID X1 X2 X3 numtwos
1  A  2  5  2       2
2  B  2  5  1       1
3  C  3  4  4       0
4  D  5  4  2       1
5  E  2  1  4       1

Psidom · Answer 5 · 2016-06-09T17:37:18.110

1

You can use do, which doesn't add the column to your original data frame and you need to add the column to your original data frame.

df %>%
    rowwise %>%
    do(numtwos = sum(.[-1] == 2)) %>% 
    data.frame
  numtwos
1       2
2       1
3       0
4       1
5       1

Add a cbind to bind the new column to the original data frame:

df %>%
     rowwise %>%
     do(numtwos = sum(.[-1] == 2)) %>% 
     data.frame %>% cbind(df, .)

  ID X1 X2 X3 numtwos
1  A  2  5  2       2
2  B  2  5  1       1
3  C  3  4  4       0
4  D  5  4  2       1
5  E  2  1  4       1

edited Jun 09 '16 at 17:37

answered Jun 09 '16 at 17:33

Psidom

209,562
33
339
356

Thanks, I was hoping `dplyr` had a cleaner way to do this. Oh well! – C_Z_ Jun 09 '16 at 17:37
Rowwise operation is always kind of painful both in `dplyr` and `data.table` since the data is column-wise stored from my understanding. – Psidom Jun 09 '16 at 17:38
@Arun, Thanks for clarifying. That's what I am guessing too. – Psidom Jun 09 '16 at 17:50

Count number of values in row using dplyr

5 Answers5

Linked

Related