Is there a dplyr equivalent to data.table::rleid?

Question

data.table offers a nice convenience function, rleid for run-length encoding:

library(data.table)
DT = data.table(grp=rep(c("A", "B", "C", "A", "B"), c(2, 2, 3, 1, 2)), value=1:10)
rleid(DT$grp)
# [1] 1 1 2 2 3 3 3 4 5 5

I can mimic this in base R with:

df <- data.frame(DT)
rep(seq_along(rle(df$grp)$values), times = rle(df$grp)$lengths)
# [1] 1 1 2 2 3 3 3 4 5 5

Does anyone know of a dplyr equivalent (?) or is the "best" way to create the rleid behavior with dplyr is to do something like the following

library(dplyr)

my_rleid = rep(seq_along(rle(df$grp)$values), times = rle(df$grp)$lengths)

df %>%
  mutate(rleid = my_rleid)

Dplyr is compatible with data.table. If, for some reason, you don't want to load data.table, I think your base solution is good. You could try filing a feature request with dplyr, but I'd say the odds of a good reception are no better than 50/50. — Frank, Nov 03 '15 at 20:05
`cumsum(c(1L, df$grp[-nrow(df)] != df$grp[-1]))` also for base — rawr, Apr 28 '16 at 20:11
Consider changing the accepted answer to this - https://stackoverflow.com/a/74428002/680068 - dplyr now has dedicated function: consecutive_id — zx8754, Feb 02 '23 at 12:59

Jaap · Answer 1 · 2022-02-13T09:08:48.823

31

You can just do (when you have both data.table and dplyr loaded):

DT <- DT %>% mutate(rlid = rleid(grp))

this gives:

> DT
    grp value rlid
 1:   A     1    1
 2:   A     2    1
 3:   B     3    2
 4:   B     4    2
 5:   C     5    3
 6:   C     6    3
 7:   C     7    3
 8:   A     8    4
 9:   B     9    5
10:   B    10    5

When you don't want to load data.table separately you can also use (as mentioned by @DavidArenburg in the comments):

DT <- DT %>% mutate(rlid = data.table::rleid(grp))

And as @RichardScriven said in his comment you can just copy/steal it:

myrleid <- data.table::rleid

edited Feb 13 '22 at 09:08

answered Nov 03 '15 at 20:04

Jaap

81,064
34
182
193

Agreed, but I'm looking to avoid the call to `data.table::rleid` if possible. – JasonAizkalns Nov 03 '15 at 20:05
3

@JasonAizkalns Why? If I may ask? – Jaap Nov 03 '15 at 20:06
To stay entirely in `dplyr`, `tidyr`, hadley-verse land. – JasonAizkalns Nov 03 '15 at 20:09
11

Steal it ... `myrleid <- data.table::rleid` – Rich Scriven Nov 03 '15 at 20:09
3

@RichardScriven that's likely what I'll resort to, but seeing if anyone else has other ideas. Another reason is stay in one "paradigm" for teaching/education purposes and avoid introducing too many packages to new users. – JasonAizkalns Nov 03 '15 at 20:11
10

@JasonAizkalns If your are only going to use the hadley-verse, then you will limit yourself very much imo. – Jaap Nov 03 '15 at 20:14
Not trying to start any debates/wars, I think I'll accept @Jaap before this gets out of hand... – JasonAizkalns Nov 03 '15 at 20:17
This works like a champ with shift() also, for which there's not a dplyr equivalent without a bunch of ugly code. – TheProletariat Aug 17 '17 at 15:19
@TheProletariat True, but it would look very similar to Alex's answer. – Jaap Aug 19 '17 at 11:51

Josh O'Brien · Answer 2 · 2015-11-03T22:23:02.350

26

If you want to use just base R and dplyr, the better way is to wrap up your own one or two line version of rleid() as a function and then apply that whenever you need it.

library(dplyr)

myrleid <- function(x) {
    x <- rle(x)$lengths
    rep(seq_along(x), times=x)
}

## Try it out
DT <- DT %>% mutate(rlid = myrleid(grp))
DT
#   grp value rlid
# 1:   A     1    1
# 2:   A     2    1
# 3:   B     3    2
# 4:   B     4    2
# 5:   C     5    3
# 6:   C     6    3
# 7:   C     7    3
# 8:   A     8    4
# 9:   B     9    5
#10:   B    10    5

edited Nov 03 '15 at 22:23

answered Nov 03 '15 at 22:04

Josh O'Brien

159,210
26
366
455

7

Small note: `rleid()` is designed to work with lists/data.frames/data.tables as well, e.g., `rleid(c(1,1,1,2,2,2), c(3,4,4,5,5,6))`. Nothing special about implementing it, but just to note the difference. – Arun Nov 03 '15 at 22:13
@Arun Should `data.table::rleid(mtcars)` work? (It doesn't, for me, though its help file would lead me believe that it should...) – Josh O'Brien Nov 04 '15 at 00:47
5

Yes, but it's `rleidv(mtcars)` (the SE version). `rleid()` takes `...` as input -- so we'll have to provide each column separately.. (for interactive cases). – Arun Nov 04 '15 at 00:49
watch out: in case of `NA` this solution does not provide the same solution of `data.table::rleid`. Check out `x <- c(1,1,1,NA,NA,2,2); myrleid(x); data.table::rleid(x)`. `rle` consider each `NA` as part of its own group. – Edo Nov 17 '20 at 15:03

Alex · Answer 3 · 2023-04-19T07:21:50.190

10

You can do it using the lag function from dplyr.

DT <-
    DT %>%
    mutate(rleid = (grp != lag(grp, 1, default = "asdf"))) %>%
    mutate(rleid = cumsum(rleid))

gives

> DT
    grp value rleid
 1:   A     1     1
 2:   A     2     1
 3:   B     3     2
 4:   B     4     2
 5:   C     5     3
 6:   C     6     3
 7:   C     7     3
 8:   A     8     4
 9:   B     9     5
10:   B    10     5

edited Apr 19 '23 at 07:21

answered Nov 03 '15 at 23:04

Alex

15,186
15
73
127

tmfmnk · Answer 4 · 2019-05-25T09:13:18.677

A simplification (involving no additional package) of the approach used by the OP could be:

DT %>%
 mutate(rleid = with(rle(grp), rep(seq_along(lengths), lengths)))

   grp value rleid
1    A     1     1
2    A     2     1
3    B     3     2
4    B     4     2
5    C     5     3
6    C     6     3
7    C     7     3
8    A     8     4
9    B     9     5
10   B    10     5

Or:

DT %>%
 mutate(rleid = rep(seq(ls <- rle(grp)$lengths), ls))

Ritchie Sacramento · Accepted Answer · 2023-02-02T13:27:57.070

9

From v1.1.0 dplyr added the function consecutive_id() modeled after data.table::rleid() with the same support for multiple vectors and the treatment of NA values.

 library(dplyr)
 
 DT %>%
   mutate(id = consecutive_id(grp)) 

    grp value id
 1:   A     1  1
 2:   A     2  1
 3:   B     3  2
 4:   B     4  2
 5:   C     5  3
 6:   C     6  3
 7:   C     7  3
 8:   A     8  4
 9:   B     9  5
10:   B    10  5

edited Feb 02 '23 at 13:27

answered Nov 14 '22 at 07:10

Ritchie Sacramento

29,890
4
48
56

1

Just checked available with dplyr_1.1.0. – zx8754 Feb 02 '23 at 13:01

score 1 · Answer 6 · answered Aug 27 '22 at 20:15

There are a lot of very good solutions here, but I would like to note that some do not give the same result as data.table::rleid() when the data has NAs. Keep in mind that data.table::rleid() increments everytime there is a change, including NAs.

Data:

library(data.table)
library(dplyr)

# Data
DT2 = data.table(grp=rep(c("A", "B", NA, "C", "A", NA, "B", NA), c(2, 2, 2, 3, 1, 1, 2, 1)), value=1:14)
df <- data.frame(DT2)

# data.table reild
DT2[, rleid := rleid(DT2$grp)]
DT2
#>      grp value rleid
#>  1:    A     1     1
#>  2:    A     2     1
#>  3:    B     3     2
#>  4:    B     4     2
#>  5: <NA>     5     3
#>  6: <NA>     6     3
#>  7:    C     7     4
#>  8:    C     8     4
#>  9:    C     9     4
#> 10:    A    10     5
#> 11: <NA>    11     6
#> 12:    B    12     7
#> 13:    B    13     7
#> 14: <NA>    14     8

Just for example, Alex's solution is perfect for OP but doesn't give same result as data.table::rleid() when dealing with NAs:

# Alex's solution
df %>% 
  mutate(rleid = (grp != lag(grp, 1, default = "asdf"))) %>%
  mutate(rleid = cumsum(rleid))
#>     grp value rleid
#> 1     A     1     1
#> 2     A     2     1
#> 3     B     3     2
#> 4     B     4     2
#> 5  <NA>     5    NA
#> 6  <NA>     6    NA
#> 7     C     7    NA
#> 8     C     8    NA
#> 9     C     9    NA
#> 10    A    10    NA
#> 11 <NA>    11    NA
#> 12    B    12    NA
#> 13    B    13    NA
#> 14 <NA>    14    NA

Here is an easy to read and understand tidyverse (although slower) equivalent to data.table::rleid():

# like rleid()
df %>% 
  mutate(
    rleid = cumsum(
      ifelse(is.na(grp), "DEFAULT", grp) != lag(ifelse(is.na(grp), "DEFAULT", grp), default = "DEFAULT")
    )
  )
#>     grp value rleid
#> 1     A     1     1
#> 2     A     2     1
#> 3     B     3     2
#> 4     B     4     2
#> 5  <NA>     5     3
#> 6  <NA>     6     3
#> 7     C     7     4
#> 8     C     8     4
#> 9     C     9     4
#> 10    A    10     5
#> 11 <NA>    11     6
#> 12    B    12     7
#> 13    B    13     7
#> 14 <NA>    14     8

Here is an easy to read and understand tidyverse equivalent to data.table::rleid() but that ignores NAs:

# like rleid() but ignoring NAs
df %>% 
 mutate(
    rleid = cumsum(
      (!is.na(grp)) & (grp != lag(ifelse(is.na(grp), "DEFAULT", grp), default = "DEFAULT"))
    )
 )
#>     grp value rleid
#> 1     A     1     1
#> 2     A     2     1
#> 3     B     3     2
#> 4     B     4     2
#> 5  <NA>     5     2
#> 6  <NA>     6     2
#> 7     C     7     3
#> 8     C     8     3
#> 9     C     9     3
#> 10    A    10     4
#> 11 <NA>    11     4
#> 12    B    12     5
#> 13    B    13     5
#> 14 <NA>    14     5

^{Created on 2022-08-27 with reprex v2.0.2}

Is there a dplyr equivalent to data.table::rleid?

6 Answers6

Linked

Related