Create a grouping variable for consecutive values within a range, after each crossing of a threshold

Question

The following problem:

I have a measurement series with temperatures (see temp.tbl).

temp.tbl <- data.frame(temp = c(13, 14, 13, 11, 10, 12,
                        16, 18, 16, 13, 10, 11,
                        12, 14, 12, 10, 8, 7, 5)
               )

Each time the temperature in the temp column falls below a certain threshold (here: 12), I want to check how many subsequent temperature values remain within a temperature range (here: 9 - 15):

The first value in the temp column that falls below the threshold (12) and all subsequent values (within the range 9 - 15) will receive the value 1 in a new column (grp).
The value that leaves the range for the first time above or below it also gets the value 1 in the grp column. In other words, these values are "combined" and assigned to group 1.
If the temperature falls below the threshold value (12) at a later time for the second time (after having left the range before), this value and the temperatures within the range get the value 2 in the grp column. These values are assigned to group 2.
Each subsequent fall below the threshold and the subsequent "range values" have been "grouped" according to their occurrence (group 3, group 4, etc.).
All other rows receive the value 0 in the grp column.

The solution would look something like this (see temp_solution.tbl):

temp_solution.tbl <- data.frame(temp = c(13, 14, 13, 11, 10, 12, 16, 18, 16, 13, 10, 11, 12, 14, 12, 10, 8, 7, 5),
                        id = c(0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 0),
                        onOff = c("off", "off", "off", "on", "on", "on", "on", "off", "off", "off", "on",
                         "on", "on", "on", "on", "on", "on", "off", "off"),
                        startEnd = c("off", "off", "off", "start", "on", "on", "end", "off", "off", "off", "start",
                            "on", "on", "on", "on", "on", "end", "off", "off")
                        )

temp_solution.tbl

    temp    id onOff startEnd  
 1    13     0   off      off     
 2    14     0   off      off     
 3    13     0   off      off     
 4    11     1    on    start # temp below threshold   
 5    10     1    on       on # temp within limits     
 6    12     1    on       on # temp within limits     
 7    16     1    on      end # first temp outside limits  
 8    18     0   off      off     
 9    16     0   off      off     
10    13     0   off      off     
11    10     2    on    start # temp below threshold   
12    11     2    on       on # temp within limits      
13    12     2    on       on      
14    14     2    on       on      
15    12     2    on       on      
16    10     2    on       on      
17     8     2    on      end # first temp outside limits    
18     7     0   off      off     
19     5     0   off      off

The values in column temp of rows 1 - 3 are all above the threshold (12) and get the value 0 in column grp.
The value in column temp in row 4 falls below the threshold for the first time: a 1 is inserted in column grp.
The temperature values in rows 5 - 6 are within the range and are assigned to group 1. The value in row 7 leaves the range for the first time (it exceeds the range) and also receives a 1 in the grp column.
Rows 8 - 10 are above the threshold value and do not fall below it: grp = 0.
Line 11 falls below the threshold value for the second time: grp = 2.
Lines 12 - 16 are within the range after falling below the threshold value: grp = 2.
Line 17 leaves the range for the first time (downwards) and also receives grp = 2.
Lines 18 - 19: grp = 0

The columns onOff and startEnd are only for illustration.

The starting point can be determined e.g. with: temp < 12 & lag(temp > 12). The range again e.g. with: between(temp, 9, 15).

temp.tbl %>% 
mutate(
    start   = temp < 12 & lag(temp > 12),
    range  = between(temp, 9, 15)
    )

But if a start point was "triggered" I have problems to connect this start point with the following range condition to get the values to a coherent "measurement series" (switch is ON). In other words, if the value in the start column changes from FALSE to TRUE, the switch is set to ON. As long as the subsequent values in the range column are TRUE, the switch remains ON; if the value in range changes to FALSE, the switch flips to OF. All subsequent ONs form a group and are numbered in ascending order.

[Create counter of consecutive runs of a certain value](https://stackoverflow.com/questions/27077228/create-counter-of-consecutive-runs-of-a-certain-value/27077300#27077300) — Henrik, Sep 17 '22 at 12:15
@Henrik Thanks for the link. But I have some problems to combine _cumsum_ with a threshold (12) and a consecutive range (15 - 9). — Mark Black, Sep 18 '22 at 05:43
Hi @Mark Black! Yes, I realize that your setting is more complex than in the link. However, I thought that some parts may (hopefully) apply here as well. I don't have time right now to look closer on your question, but it seems like an interesting problem. And not the least: thanks for posting a small, carefully annotated example! Cheers — Henrik, Sep 18 '22 at 08:34
Hello @Henrik, Thanks for your reply. Yes, your link seems to me to offer certain partial answers, but as you mentioned, it is a bit more complex. Thanks also for the compliment: if I already "steal" valuable time from other people, I try to describe my questions somehow as comprehensive and precise as possible...but I don't always succeed as hoped :-) All the best — Mark Black, Sep 18 '22 at 09:38

Henrik · Accepted Answer · 2022-09-18T15:58:14.400

Here I use data.table functions, but it should be straightforward to translate it to base or dplyr.

Check if the current value is below threshold (temp < 12) and (&) the preceeding value is above threshold (shift(temp) > 12). Create a grouping variable ('id') by using cumsum on the logical result.

Check if values are between the lower and upper limits ('rng').

Within each group (by = id):

To include also the first value that is outside the limits, check if current 'rng' is FALSE (!rng) and the preceeding value is TRUE (shift(rng)).

To handles cases where values have been outside the limits and then return back within limits (e.g. row 10), use cummin on the logical 'rng'. Thus, once a value has become FALSE, it stays FALSE. Coerce result to logical.

Check if either (|) of the two tests is TRUE and multiply with 'id'. The FALSE rows will become zero.

library(data.table)
d = as.data.table(tmp.tbl)

d[ , id := cumsum(temp < 12 & shift(temp) > 12)]
d[ , rng := between(temp, lower = 9, upper = 15)]
d[ , id := id * ((!rng & shift(rng)) | as.logical(cummin(rng))), by = id]

     temp    id    rng
    <num> <int> <lgcl>
 1:    13     0   TRUE
 2:    14     0   TRUE
 3:    13     0   TRUE
 4:    11     1   TRUE
 5:    10     1   TRUE
 6:    12     1   TRUE
 7:    16     1  FALSE
 8:    18     0  FALSE
 9:    16     0  FALSE
10:    13     0   TRUE
11:    10     2   TRUE
12:    11     2   TRUE
13:    12     2   TRUE
14:    14     2   TRUE
15:    12     2   TRUE
16:    10     2   TRUE
17:     8     2  FALSE
18:     7     0  FALSE
19:     5     0  FALSE

Hi @Henrik, your suggested solution works perfectly for the example, thanks a lot! But now I almost have a guilty conscience because you worked on a solution despite lack of time. — Mark Black, Sep 18 '22 at 15:46

Create a grouping variable for consecutive values within a range, after each crossing of a threshold

1 Answers1