8

I have a data frame that has similar structure to the following:

set.seed(123)  
df<-data_frame(SectionName = rep(letters[1:2], 50),
               TimeSpentSeconds = sample(0:360, 100, replace = TRUE),
               Correct = sample(0:1, 100, replace = TRUE))

I want summarise this data frame by taking all values of TimeSpentSeconds that fall into certain ranges (less than 30, between 30-60, between 60-90, ..., greater than 180), label the times as those ranges, group them by SectionName, and find the sum of the Correct column so that the resulting data frame looks (something) like this:

    TimeGroup             SectionName Correct
   <fct>                 <chr>         <int>
 1 LessThan30Secs        a                 2
 2 LessThan30Secs        b                 3
 3 30-60 Seconds         a                 4
 4 30-60 Seconds         b                 3
 5 60-90 Seconds         a                 2
 6 60-90 Seconds         b                 3
 7 90-120 Seconds        a                 4
 8 90-120 Seconds        b                 0
 9 120-150 Seconds       a                 4
10 120-150 Seconds       b                 0
11 150-180 Seconds       a                 1
12 150-180 Seconds       b                 2
13 GreaterThan180Seconds a                11
14 GreaterThan180Seconds b                11

I was able to successfully do this with the following if-else code where I mutated all of the times into a new column with the appropriate label, grouped, and summarised:

x <- c("LessThan30Secs", "30-60 Seconds", "60-90 Seconds","90-120 Seconds", 
           "120-150 Seconds", "150-180 Seconds", "GreaterThan180Seconds") 

df %>% 
mutate(TimeGroup = if_else(TimeSpentSeconds >= 0 & TimeSpentSeconds <= 30, "LessThan30Secs",
                if_else(TimeSpentSeconds > 30 & TimeSpentSeconds <= 60, "30-60 Seconds",
                if_else(TimeSpentSeconds > 60 & TimeSpentSeconds <= 90, "60-90 Seconds",
                if_else(TimeSpentSeconds > 90 & TimeSpentSeconds <= 120, "90-120 Seconds",
                if_else(TimeSpentSeconds > 120 & TimeSpentSeconds <= 150, "120-150 Seconds", 
                if_else(TimeSpentSeconds > 150 & TimeSpentSeconds <= 180, "150-180 Seconds",
                if_else(TimeSpentSeconds > 180, "GreaterThan180Seconds", "")))))))) %>%
    mutate(TimeGroup = factor(TimeGroup, levels = x)) %>%
    arrange(TimeGroup) %>%
    group_by(TimeGroup, SectionName) %>%
    summarise(Correct = sum(Correct))

But, there just has to be a better way to do this. I considered writing a function, but didn't get very far as I'm not great at function writing.

Does anyone have any ideas on a more elegant way to accomplish this same output through a dplyr method I didn't think of, writing a custom function maybe utilizing the purrr package at some point, or some other r function?

Carl
  • 4,232
  • 2
  • 12
  • 24
KyleF
  • 115
  • 1
  • 1
  • 9
  • 1
    Instead of using `if_else`, try with `cut` or `findInterval` i.e. `df %>% group_by(TimeGroup = cut(TimeSpentSeconds, breaks = c(seq(0, 180, by = 30), Inf)), SectionName) %>% summarise(Correct = sum(Correct))` – akrun Aug 26 '18 at 17:41
  • 2
    It is better to have a `set.seed` to make this reproducible – akrun Aug 26 '18 at 17:44
  • 1
    Though in this case `findInterval` is the better solution, if the intervals were not perfectly-consecutive then it would be more appropriate to use `dplyr::case_when`. – r2evans Aug 26 '18 at 19:06

3 Answers3

25

case_when() will do what you want. Its a tidy alternative to nested ifelse() statements.

library(dplyr)

mutate(df,
       TimeGroup = case_when(
         TimeSpentSeconds <= 30 ~ "30 Seconds or less",
         TimeSpentSeconds <= 60 ~ "31-60 Seconds",
         TimeSpentSeconds <= 90 ~ "61-90 Seconds",
         TimeSpentSeconds <= 120 ~ "91-120 Seconds",
         TimeSpentSeconds <= 150 ~ "121-150 Seconds", 
         TimeSpentSeconds <= 180 ~ "151-180 Seconds",
         TimeSpentSeconds > 180 ~ "Greater Than 180 Seconds",
         TRUE ~ NA_character_)
)

The last argument is a catch all for records that don't fit any of the criteria, such as if time is somehow less than 0 seconds.

Paul
  • 2,877
  • 1
  • 12
  • 28
  • Beware that all "formulas" (after the `~`) must be of the same type. Thanks! – Lionel Trebuchon Jun 11 '19 at 20:21
  • Thanks, I've fixed the TRUE case. – Paul Jun 12 '19 at 01:14
  • 1
    It does not need to compare every time, a statement that has been rejected. For example, in the second statement, you say `TimeSpentSeconds > 30`, but in order to check it, we have passed `TimeSpentSeconds <= 30`, so it must be bigger than 30. – LoukasPap Feb 23 '21 at 16:06
4

We can do this easily with cut (or findInterval) instead of multiple nested ifelse statements

lbls <- c('LessThan30secs', '30-60 Seconds', '60-90 Seconds', 
    '90-120 Seconds', '120-150 Seconds', '150-180 Seconds', 'GreaterThan180Seconds')
df %>% 
    group_by(TimeGroup = cut(TimeSpentSeconds, 
                 breaks = c(seq(0, 180, by = 30), Inf), labels = lbls), 
            SectionName) %>%
   summarise(Correct = sum(Correct)) %>%
   na.omit
akrun
  • 874,273
  • 37
  • 540
  • 662
  • This is great! Just one issue. When I place it in my actual data, I get different values for the two groups for "LessThan30secs". After investigating, I've found that this code does not include in the "LessThan30secs" group anything that has df$TimeSpentSeconds == 0. Is there a way to use cut to include the 0? – KyleF Sep 03 '18 at 21:25
1
``` r
library(tidyverse)
set.seed(123)
df<-data_frame(SectionName = rep(letters[1:2], 50),
               TimeSpentSeconds = sample(0:360, 100, replace = TRUE),
               Correct = sample(0:1, 100, replace = TRUE))

time_spent_range <- function(value, start, end, interval) {
  end <- end + (end%%interval) # make sure the end value is divisible by the interval
  bins_start <- seq(start, end - interval, by = interval)
  bins_end <- seq(start + interval, end, by = interval)
  bins_tibble <- tibble(bin_start = bins_start, 
                        bin_end = bins_end) %>% 
    mutate(in_bin = if_else((value > bin_start|(value == 0 & bin_start == 0)) 
                            & value <= bin_end,
                            1,
                            0)) %>% 
    filter(in_bin == 1)
  bin <- paste0(as.character(bins_tibble$bin_start[1]), 
                '-', 
                as.character(bins_tibble$bin_end[1]),
                ' Seconds')
  return(bin)
}

df %>% 
  mutate(TimeGroup = map_chr(TimeSpentSeconds, time_spent_range, start = 0, end = max(df$TimeSpentSeconds) , interval = 30))
#> # A tibble: 100 x 4
#>    SectionName TimeSpentSeconds Correct TimeGroup      
#>    <chr>                  <int>   <int> <chr>          
#>  1 a                        103       1 90-120 Seconds 
#>  2 b                        284       0 270-300 Seconds
#>  3 a                        147       0 120-150 Seconds
#>  4 b                        318       1 300-330 Seconds
#>  5 a                        339       0 330-360 Seconds
#>  6 b                         16       1 0-30 Seconds   
#>  7 a                        190       1 180-210 Seconds
#>  8 b                        322       1 300-330 Seconds
#>  9 a                        199       0 180-210 Seconds
#> 10 b                        164       0 150-180 Seconds
#> # ... with 90 more rows
```

Created on 2018-08-26 by the reprex package (v0.2.0).