Group persons based on birth of year in R

Question

I have the following dataset

df<- data.frame(x1=c(1,5,7,8,2,2,3,4,5,10),
birthyear=c(1992,1994,1993,1992,1995,1999,2000,2001,2000, 1994))

I want to group persons in 3-year intervals together so that persons born in 1992-1994 are group 1 and 1995-1997 are in group 2 and so on. I have a far larger dataset with over 10000 entries. How could I do it the most efficient way?

you could just use a stacked ifelse. `ifelse(dplyr::between(birthyear, 1992, 1994), 1, ifelse(dplyr::between(birthyear, 1995, 1997), 2 , ifelse(...))))` — D.J, May 03 '22 at 09:31

score 10 · Accepted Answer · answered May 03 '22 at 09:33

I would simply use cut with breaks defined with seq:

df$group <- cut(df$birthyear,
                seq(1992, 2022, 3),
                labels = F,
                right = F)
df

Output:

#>    x1 birthyear group
#> 1   1      1992     1
#> 2   5      1994     1
#> 3   7      1993     1
#> 4   8      1992     1
#> 5   2      1995     2
#> 6   2      1999     3
#> 7   3      2000     3
#> 8   4      2001     4
#> 9   5      2000     3
#> 10 10      1994     1

^{Created on 2022-05-03 by the reprex package (v2.0.1)}

score 1 · Answer 2 · answered May 03 '22 at 09:29

Here is a rather manual approach using case_when, where you define the span of years for each group. When using case_when, you define a condition, e.g. birthyear > 1991 & birthyear < 1995, and the outcome using a tilde ~, e.g. ~ 1.

library(dplyr)

df<- data.frame(x1=c(1,5,7,8,2,2,3,4,5,10),
                birthyear=c(1992,1994,1993,1992,1995,1999,2000,2001,2000, 1994))

df %>% 
  mutate(group = case_when(
    birthyear > 1991 & birthyear < 1995 ~ 1,
    birthyear > 1994 & birthyear < 1997 ~ 2,
    birthyear > 1997 & birthyear < 2002 ~ 3
  ))

#>    x1 birthyear group
#> 1   1      1992     1
#> 2   5      1994     1
#> 3   7      1993     1
#> 4   8      1992     1
#> 5   2      1995     2
#> 6   2      1999     3
#> 7   3      2000     3
#> 8   4      2001     3
#> 9   5      2000     3
#> 10 10      1994     1

^{Created on 2022-05-03 by the reprex package (v0.3.0)}

GKi · Answer 3 · 2022-05-04T04:35:12.163

Using integer division %/% might be an efficient way.

df$group <- (df$birthyear - 1989L) %/% 3L
df
#   x1 birthyear group
#1   1      1992     1
#2   5      1994     1
#3   7      1993     1
#4   8      1992     1
#5   2      1995     2
#6   2      1999     3
#7   3      2000     3
#8   4      2001     4
#9   5      2000     3
#10 10      1994     1

To start from the lowest birthyear:

(df$birthyear - min(df$birthyear) + 3L) %/% 3L
# [1] 1 1 1 1 2 3 3 4 3 1

In case the rang should be tested pmin and pmax can be used.

(pmax(1989L, pmin(2023L, df$birthyear)) - 1989L) %/% 3L
# [1] 1 1 1 1 2 3 3 4 3 1

Also findInterval could be used.

findInterval(df$birthyear, seq(1992, 2022, 3))
# [1] 1 1 1 1 2 3 3 4 3 1

Benchmark:

set.seed(42)
x <- sample(1992:2021, 10001, TRUE)
bench::mark(
         "cut" = cut(x, seq(1992, 2022, 3), labels = F, right = F),
         "findInterval" = findInterval(x, seq(1992, 2022, 3)),
         "%/%pminMax" = (pmax(1989L, pmin(2023L, x)) - 1989L) %/% 3L,
         "%/%min" = (x - min(x) + 3L) %/% 3L,
         "%/%" = (x - 1989L) %/% 3L
         )
#  expression        min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 cut             219µs  223.9µs     3875.   117.3KB     8.17  1898     4
#2 findInterval  143.2µs  148.9µs     6450.   117.3KB    13.6   2855     6
#3 %/%pminMax     75.2µs   77.7µs    12263.   117.4KB    27.3   5835    13
#4 %/%min         53.7µs   54.1µs    18153.    39.1KB    12.3   8852     6
#5 %/%            35.5µs   35.9µs    27166.    39.1KB    19.0   9993     7

Group persons based on birth of year in R

3 Answers3