3

I have data frame with several columns I need to regroup sequence of col2 in order that after label change from a to b or b to a they grouped by with new label which you can see the result in Desired column

testdf <- data.frame(mydate = seq(as.Date('2012-01-01'), 
                                  as.Date('2012-01-10'), by = 'day'),
                     col1 = 1:10,
                     col2 = c("a","a","b","b","a","b","a","b","a","a"),
                     Desired= c(1,1,2,2,3,4,5,6,7,7))

       mydate col1 col2 Desired
1  2012-01-01    1    a       1
2  2012-01-02    2    a       1
3  2012-01-03    3    b       2
4  2012-01-04    4    b       2
5  2012-01-05    5    a       3
6  2012-01-06    6    b       4
7  2012-01-07    7    a       5
8  2012-01-08    8    b       6
9  2012-01-09    9    a       7
10 2012-01-10   10    a       7
Are there any ways to solve this problem without FOR loops. because the dataset has more than 1 million rows.
user227710
  • 3,164
  • 18
  • 35
  • 1
    I think this is a duplicate question, but here's one way: `r <- rle(as.character(testdf$col2)); r$values <- seq_along(r$values); inverse.rle(r)` There is also a nice function for this `rleid` in the `data.table` package. – Frank Jul 09 '15 at 16:28
  • General advice: with that many records, you should consider using data tables instead of dataframes (for code elegance and computational efficiency), [see this](http://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf) – shekeine Jul 09 '15 at 16:35
  • As @Frank mentioned, just `library(data.table) ; rleid(testdf$col2)` should do (with the devel version) – David Arenburg Jul 09 '15 at 18:49

2 Answers2

1

You could try this:

output <- c(0,cumsum(diff(as.numeric(testdf$col2))!=0))+1
#> output
#[1] 1 1 2 2 3 4 5 6 7 7
RHertel
  • 23,412
  • 5
  • 38
  • 64
1

This is a more in vogue way of doing this.

testdf %>% group_by(col2) %>% mutate(first = cumsum(as.numeric(col2))
daniel
  • 1,186
  • 2
  • 12
  • 21
  • It may be "en vogue", but are you sure that this produces the desired output? If I remove the target column with `testdf <- testdf[,-4]` and use, according to your command sequence, `p <- testdf %>% group_by(col2) %>% mutate(first = cumsum(as.numeric(col2)))`, then this yield on my computer a result for `p` that does not resemble much the desired output. – RHertel Jul 09 '15 at 17:05