Labeling contiguous chunks of observations without a for loop

Question

I have a standard 'can-I-avoid-a-loop' problem, but cannot find a solution.

I answered this question by @splaisan but I had to resort to some ugly contortions in the middle section, with a for and multiple if tests. I simulate a simpler version here in the hope that someone can give a better answer...

THE PROBLEM

Given a data structure like this:

df <- read.table(text = 'type
a
a
a
b
b
c
c
c
c
d
e', header = TRUE)

I want to identify contiguous chunks of the same type and label them in groups. The first chunk should be labelled 0, the next 1, and so on. There is an indefinite number of chunks, and each chunk may be as short as only one member.

type    label
   a    0
   a    0
   a    0
   b    1
   b    1
   c    2
   c    2
   c    2
   c    2
   d    3
   e    4

MY SOLUTION

I had to resort to a for loop to do this, here is the code:

label <- 0
df$label <- label

# LOOP through the label column and increment the label
# whenever a new type is found
for (i in 2:length(df$type)) {
    if (df$type[i-1] != df$type[i]) { label <- label + 1 }
    df$label[i] <- label
}

MY QUESTION

Can anyone do this without the loop and conditionals?

See `?rle`, the most useful R function no one can ever find. — joran, May 15 '12 at 22:38
Thanks @joran, I can see how that would help! I will explore it for a while. My first efforts are working but it is still inelegant. I will post an answer if I manage a passable one. — daedalus, May 15 '12 at 22:50
Just feed the lengths component from `rle` into the times argument in `rep`. — joran, May 15 '12 at 22:54

score 6 · Accepted Answer · answered May 15 '12 at 22:58

Using rle

r <- rle(as.numeric(df$type))
df$label <- rep(seq(from=0, length=length(r$lengths)), times=r$lengths)

Not using rle, but cumsum over logicals that are coerced to numeric.

df$label <- c(0,cumsum(df$type[-1] != df$type[-length(df$type)]))

Both give:

> df
   type label
1     a     0
2     a     0
3     a     0
4     b     1
5     b     1
6     c     2
7     c     2
8     c     2
9     c     2
10    d     3
11    e     4

score 3 · Answer 2 · answered May 15 '12 at 23:05

3

My crack at it:

as.numeric(df[, 1])-1

answered May 15 '12 at 23:05

Tyler Rinker

108,132
65
322
519

Oh this is really the same as Joran's, he beat me by a few seconds. You'd have to convert to a factor as he states if type isn't already so. – Tyler Rinker May 15 '12 at 23:09
Yes, you all have me torn as to where to place the green tick! I leave it with Brian given that he had the first complete working version. Thanks to all, though, much appreciated. – daedalus May 15 '12 at 23:11
1

I think Brian's solution is more generalizable. Don't tick me as Joran beat me and his is essentially the same as mine but better. – Tyler Rinker May 15 '12 at 23:14

score 2 · Answer 3 · answered May 15 '12 at 23:05

2

This just occurred to me as well, you can simply convert to a factor, then back to integers and subtract one:

as.integer(as.factor(df$type))-1

If type is already a factor, you can skip that step.

answered May 15 '12 at 23:05

joran

169,992
32
429
468

1

...assuming that any single value of `df$type` doesn't appear in more than one chunk and that they appear in alphabetical order. – Brian Diggs May 15 '12 at 23:12

Labeling contiguous chunks of observations without a for loop

3 Answers3

Linked