3

I have a data_frame where a character variable x changes in time. I want to count the number of times it changes, and fill a new vector with this count.

df <- data_frame(
  x = c("a", "a", "b", "b", "c", "b"),
  wanted = c(1, 1, 2, 2, 3, 4)
)
  x wanted
1 a      1
2 a      1
3 b      2
4 b      2
5 c      3
6 b      4

This is similar to, but different from rle(df$x), which would return

Run Length Encoding
  lengths: int [1:4] 2 2 1 1
  values : chr [1:4] "a" "b" "c" "b"

I could try to rep() that output. I have also tried this, which is awfully close, but not for reasons I can't figure out immediately:

 df %>% mutate( 
   try_1 = cumsum(ifelse(x == lead(x) | is.na(lead(x)), 1, 0)) 
   )
Source: local data frame [6 x 3]

  x wanted try_1
1 a      1     1
2 a      1     1
3 b      2     2
4 b      2     2
5 c      3     2
6 b      4     3

It seems like there should be a function that does this directly, that I just haven't found in my experience.

gregmacfarlane
  • 2,121
  • 3
  • 24
  • 53

2 Answers2

6

Try this dplyr code:

df %>%
  mutate(try_1 = cumsum(ifelse(x != lag(x) | is.na(lag(x)), 1, 0)))

  x wanted try_1
1 a      1     1
2 a      1     1
3 b      2     2
4 b      2     2
5 c      3     3
6 b      4     4

Yours was saying: increment the count if a value is the same as the following row's value, or if the following row's value is NA.

This says: increment the count if the variable on this row either is different than the one on the previous row, or if there wasn't one on the previous row (e.g., row 1).

Sam Firke
  • 21,571
  • 9
  • 87
  • 105
  • You created with data.frame(), the original post created with data_frame(). Looks like this works on a character variable, not a factor. – Sam Firke Mar 31 '15 at 19:38
  • Yep if I use the `data.frame` from your post but add in `df$x <- as.character(df$x)` it works for me. – Sam Firke Mar 31 '15 at 19:40
  • 2
    You could shorten the code `mutate(df, try_1 = cumsum(x!=lag(x)|is.na(lag(x))))` as `TRUE/FALSE` will coerce to numeric values `1/0` by cumsum – akrun Mar 31 '15 at 19:41
  • 1
    `default` option in `lag` can be also used as well `mutate(df, try_1 = cumsum(x!=lag(x, default=1)))` – akrun Mar 31 '15 at 20:13
4

You can try

library(data.table) #data.table_1.9.5
setDT(df)[, wanted := rleid(x)][]
#    x wanted
#1: a      1
#2: a      1
#3: b      2
#4: b      2
#5: c      3
#6: b      4

Or a base R option would be

inverse.rle(within.list(rle(as.character(df$x)),
                          values<- seq_along(values)))
#[1] 1 1 2 2 3 4

data

df <- data.frame(x=c("a", "a", "b", "b", "c", "b"))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Ah. What I actually did, to avoid bringing in `data.table`, was to copy the code `rleid=function(x){r=rle(x);rep(1:length(r$lengths),r$lengths)}` from [this comment](http://stackoverflow.com/questions/29122618/r-split-data-frame-using-a-column-that-represents-and-on-off-switch#comment46471954_29122728) – gregmacfarlane Mar 31 '15 at 18:55
  • @gregmacfarlane That works too, But, `rleid` is a nice wrapper function – akrun Mar 31 '15 at 18:58
  • Yeah, it is a nice function. In fact, it's a lot like what I was thinking in terms of wrangling with the `rle()` output. But for my purposes just including that function in my script is easier than bringing in `data.table` and worrying about name conflicts. – gregmacfarlane Mar 31 '15 at 19:01
  • 2
    `rleid()` also works with more than 1 column (and is written in C for speed). @gregmacfarlane, what name conflicts are you worrying about? – Arun Mar 31 '15 at 19:10
  • `last()`, mostly. I realize that I can be explicit or load `dplyr` afterwards. But for our application it's easier to just not use `data.table`. – gregmacfarlane Mar 31 '15 at 19:54