consecutive group number with a threshold in R

Question

This problem is very similar to Consecutive group number in R, but I think this problem is not the same problem, but a much harder one.

I am currently dealing with a car data. We recorded the speed of the car every 5 minutes, and it contains a lot of zero values. I want to add a new column where a consecutive number of k or more than k zero speeds are numbered as 0, while other sections are numbered (starting from 1). Let's take a sample data as example:

sample <- data.frame(
  id = 1:15, 
  speed = c(50, 0, 0, 0, 50, 40, 0, 0, 25, 30, 50, 0, 30, 50, 40))

Specifically for this sample data, let's say k equals 2, then my desired result should be like this:

    id speed number
1   1    50      1
2   2     0      0
3   3     0      0
4   4     0      0
5   5    50      2
6   6    40      2
7   7     0      0
8   8     0      0
9   9    25      3
10 10    30      3
11 11    50      3
12 12     0      3** <- here is the difference
13 13    30      3
14 14    50      3
15 15    40      3

There are more than 1 million rows in my data, so I hope that the solution could be acceptable in speed.

The reason for setting a threshold "k" is that some drivers just leave their GPS open even if they lock the car and go to sleep. But in other occasions, where the interval is less than k, they just stopped because of the crossroad light. I want to focus on the longtime stops and just ignore the short time stops.

Hope my question makes sense to you. Thank you.

You can adapt one of the answers from that question, for ex. `r <- rle(x !=0 | (x==0 & lag(x)>0 & lead(x)>0)) ; r$values[r$values] <- cumsum(r$values[r$values]) ; inverse.rle(r)` — Lamia, Aug 07 '17 at 22:45
@Lamia Would you mind expanding on that in an answer? Is `lead` from `dplyr` or `data.table` (Or somewhere else?) — Luke C, Aug 07 '17 at 22:55
@LukeC I slightly modified one of the answers to the question the OP mentioned and this related [question](https://stackoverflow.com/questions/27077228/consecutive-value-after-column-value-change-in-r). Yes, `lead/lag` are from the `dplyr` package. — Lamia, Aug 07 '17 at 23:08
@Lamia Great, thank you. I did `x <- sample$speed` and don't quite get the values in OP's `$number` column (although it's close). I'll keep fiddling and reread those linked questions to see if I'm missing something- thanks for your response. — Luke C, Aug 07 '17 at 23:13
@LukeC there is a discrepancy between the data in the dataframe and in the example shown (4th value being 0 or 30). I'll edit the question to remove the error. — Lamia, Aug 07 '17 at 23:49
@Lamia Thanks. Your solution works perfectly for this example in which k equals 2. However, this solution seems not to work as k scales up. — Miao Cai, Aug 08 '17 at 00:38

Lamia · Answer 1 · 2017-08-08T01:16:23.903

2

You can do this, inspired by user20650's comment to this question:

numbering = function(v,k) {
  ## First, replacing stretches of less than k consecutive 0s by 1s
  r = rle(v);
  r$values[r$values==0 & r$lengths<k] = 1; 
  v2 = inverse.rle(r); 

  ## Then numbering consecutive stretches of non-zero values
  r2 = rle(v2!=0);  
  r2$values[r2$values] = cumsum(r2$values[r2$values]);
  return(inverse.rle(r2))
}

numbering(sample$speed,2)
[1] 1 0 0 0 2 2 0 0 3 3 3 3 3 3 3

numbering(sample$speed,3)
[1] 1 0 0 0 2 2 2 2 2 2 2 2 2 2 2

edited Aug 08 '17 at 01:16

answered Aug 08 '17 at 01:10

Lamia

3,845
1
12
19

Thanks for your answer. This works perfect for this problem. One more question, does this answer work fast enough when the data cases are millions. I know that data.table is a very fast package for dealing with big data. So can I write this function within a data.table? Thanks – Miao Cai Aug 08 '17 at 22:30
You'll have to try and see.. :) Indeed, the data.table is known for its speed in handling large datasets, unfortunately I'm not very familiar with it, so can't say how you could adapt this using data.table. – Lamia Aug 08 '17 at 23:20
Thanks for your answer. I will accept it if no better answer can be provided. – Miao Cai Aug 09 '17 at 01:31

consecutive group number with a threshold in R

1 Answers1