group_by() and summarise() by row

Question

I have a data with several line ids per time and with -infinite values, and I would like to use the R packages dplyr and tidyverse to calculate the average number of -infinite per ID per time.

This is my data:

dt <- data.frame(id = rep(1:3, each = 4), 
                 time = rep(1:4, time=3), 
                 x = c(1, 2, 1, -Inf, 2, -Inf,1, 1, 5, 1, 2, 1), 
                 y = c(2, -Inf, -Inf, -Inf, -Inf, 5, -Inf, 2, 1, 2, 2, 2))

In the real data I have more than 100 columns but to simplify I put only x and y.

The expected result:

  id time   n
2  1    2 0.5
3  1    3 0.5
4  1    4 1.0
5  2    1 0.5
6  2    2 0.5
7  2    3 0.5

The idea is to use some specific columns to generate a vector according to a specific calculation function. After searching I found the rowwise() function, but it did not help, Here is my attempt:

dt %>%
  group_by(id,time) %>%
  summarise(n = across(x:y, ~mean(is.infinite(x) & x < 0, na.rm=TRUE)))

dt %>%
  group_by(id,time) %>% 
  rowwise() %>%
  summarise(n = across(everything(), ~mean(is.infinite(x) & x < 0, na.rm=TRUE)))

dt %>%
  rowwise() %>%
  summarise(n = across(everything(), ~mean(is.infinite(x) & x < 0, na.rm=TRUE)))
 
# same results:
`summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
# A tibble: 12 x 3
# Groups:   id [3]
      id  time   n$x    $y
   <int> <int> <dbl> <dbl>
 1     1     1     0     0
 2     1     2     0     0
 3     1     3     0     0
 4     1     4     1     1
 5     2     1     0     0
 6     2     2     1     1
 7     2     3     0     0
 8     2     4     0     0
 9     3     1     0     0
10     3     2     0     0
11     3     3     0     0
12     3     4     0     0

Could you help me to generate this vector n?

I'm not sure that's possible from your data. Your 12 rows are 12 unique combinations of id and time. So for every subgroup there's one value, and it's either infinite (giving 1) or not (giving 0) — Andy Baxter, Nov 14 '22 at 16:03
Apologies, I see now what you are aiming for - `x` and `y` are two observations and you're averaging over both? See if code below helps. — Andy Baxter, Nov 14 '22 at 16:14

Andy Baxter · Accepted Answer · 2022-11-14T19:29:24.510

I think I understand better what you're aiming to do here. across isn't needed (as it's more for modifying columns in place). Either rowwise or group_by would work:

library(dplyr)

dt <- data.frame(id = rep(1:3, each = 4), 
                 time = rep(1:4, times = 3), 
                 x = c(1, 2, 1, -Inf, 2, -Inf,1, 1, 5, 1, 2, 1), 
                 y = c(2, -Inf, -Inf, -Inf, -Inf, 5, -Inf, 2, 1, 2, 2, 2)) 

dt %>% 
  group_by(id, time) %>% 
  summarise(n = mean(c(is.infinite(x), is.infinite(y)))) %>% 
  filter(n != 0)
#> `summarise()` has grouped output by 'id'. You can override using the `.groups`
#> argument.
#> # A tibble: 6 × 3
#> # Groups:   id [2]
#>      id  time     n
#>   <int> <int> <dbl>
#> 1     1     2   0.5
#> 2     1     3   0.5
#> 3     1     4   1  
#> 4     2     1   0.5
#> 5     2     2   0.5
#> 6     2     3   0.5

Here's a possible way of doing the calculation across any number of columns after grouping (by making a quick function to check the negative and the infinite value):

library(dplyr)

dt <- data.frame(id = rep(1:3, each = 4), 
                 time = rep(1:4, times = 3), 
                 x = c(1, 2, 1, -Inf, 2, -Inf,1, 1, 5, 1, 2, 1), 
                 y = c(2, -Inf, -Inf, -Inf, -Inf, 5, -Inf, 2, 1, 2, 2, 2),
                 z = sample(c(1, 2, -Inf), 12, replace = TRUE)) 

is_minus_inf <- function(x) is.infinite(x) & x < 0

dt %>% 
  group_by(id, time) %>% 
  mutate(n = mean(is_minus_inf(c_across(everything()))))
#> # A tibble: 12 × 6
#> # Groups:   id, time [12]
#>       id  time     x     y     z     n
#>    <int> <int> <dbl> <dbl> <dbl> <dbl>
#>  1     1     1     1     2     2 0    
#>  2     1     2     2  -Inf  -Inf 0.667
#>  3     1     3     1  -Inf     2 0.333
#>  4     1     4  -Inf  -Inf     1 0.667
#>  5     2     1     2  -Inf     1 0.333
#>  6     2     2  -Inf     5     2 0.333
#>  7     2     3     1  -Inf  -Inf 0.667
#>  8     2     4     1     2     2 0    
#>  9     3     1     5     1     1 0    
#> 10     3     2     1     2     1 0    
#> 11     3     3     2     2     2 0    
#> 12     3     4     1     2  -Inf 0.333

(Or even simpler, use mutate(n = mean(c_across(everything()) == -Inf, na.rm = TRUE)) and no new checking function is needed)

Thank you @AndyBaxter for your help! Can we improve the code to cover more columns (+100), in the example above I gave only two coloennes x, y to simplify? — Ph.D.Student, Nov 14 '22 at 19:08
Hello @Ph.D.Student - I've tried a simple way of doing all (non-grouped) columns by `c_across(everything())`. Added a `z` column as an example to show what's possible. You can repace `mutate` with `transmute` if you want to drop the summed columns. Hope that helps! — Andy Baxter, Nov 14 '22 at 19:24

SamuelR · Answer 2 · 2022-11-14T16:42:35.953

2

How about this solution? It looks like giving the desired output and is scalable.
First I "melt" the columns x and y and then just summarise over id and time:

dt %>% 
   reshape2::melt(id = c("id", "time")) %>% 
   group_by(id, time) %>% 
   summarise(count_neg_inf = mean(value == -Inf, na.rm = TRUE))

regards,
Samuel

edited Nov 14 '22 at 16:42

answered Nov 14 '22 at 16:35

SamuelR

121
6

1

Thank you for your help, I used your solution and modified slightly summarise(count_neg_inf = mean(is.infinite(value) & value < 0, na.rm=TRUE)) %>% filter(count_neg_inf != 0) – Ph.D.Student Nov 14 '22 at 16:45

group_by() and summarise() by row

2 Answers2