R- Find date when cumulative sum is above threshold

Question

I am working with thermal requirements for crop growth. I have a table which contains cumulative temperature for a 6-month time period. Sample seen below:

         date      temp   cum_temp
 1: 2020-03-01  9.339748   9.339748
 2: 2020-03-02 23.860849  33.200597
 3: 2020-03-03 12.860331  46.060928
 4: 2020-03-04 26.607505  72.668432
 5: 2020-03-05 28.273551 100.941984
 6: 2020-03-06  2.321138 103.263122
 7: 2020-03-07 16.315059 119.578181
 8: 2020-03-08 26.880152 146.458334
 9: 2020-03-09 16.991615 163.449949
10: 2020-03-10 14.241827 177.691776
11: 2020-03-11 28.748167 206.439943
12: 2020-03-12 14.146691 220.586634
13: 2020-03-13 20.649548 241.236182
14: 2020-03-14 17.606369 258.842551
15: 2020-03-15  3.984816 262.827367

Then, I also have a table with a list of crop growth stages and their thermal requirements (i.e. the thermal thresholds needed to reach each stage):

 growth_stage thermal_req
1:           VE         120
2:           V2         200
3:           V3         350
4:        V5-V6         475
5:        V7-V9         610
6:           R2        1660
7:           R4        1925
8:           R5        2450
9:           R6        2700

Based on those tables, I need two outcomes:

Looking at the temperature table, update the thermal requirements table with the date when the thermal requirement was reached. For this example it would look like this:

  growth_stage thermal_req date_reached
1:           VE         120   2020-03-08
2:           V2         200   2020-03-11
3:           V3         350   2020-03-21
4:        V5-V6         475   2020-03-26
5:        V7-V9         610   2020-04-03
6:           R2        1660   2020-06-14
7:           R4        1925   2020-06-30
8:           R5        2450   2020-08-06
9:           R6        2700   2020-08-23

Update the original temperature table to include a new column "growth stage" based on the thermal requirements shown in the second table. It would look like this:

         date      temp   cum_temp  growth_stage
 1: 2020-03-01  9.339748   9.339748 NA
 2: 2020-03-02 23.860849  33.200597 NA
 3: 2020-03-03 12.860331  46.060928 NA
 4: 2020-03-04 26.607505  72.668432 NA
 5: 2020-03-05 28.273551 100.941984 NA
 6: 2020-03-06  2.321138 103.263122 NA
 7: 2020-03-07 16.315059 119.578181 NA
 8: 2020-03-08 26.880152 146.458334 VE
 9: 2020-03-09 16.991615 163.449949 VE
10: 2020-03-10 14.241827 177.691776 VE
11: 2020-03-11 28.748167 206.439943 V2
12: 2020-03-12 14.146691 220.586634 V2
13: 2020-03-13 20.649548 241.236182 V2
14: 2020-03-14 17.606369 258.842551 V2
15: 2020-03-15  3.984816 262.827367 V2
16: 2020-03-16 27.094924 289.922291 V2
17: 2020-03-17  8.136544 298.058835 V2
18: 2020-03-18  2.219726 300.278562 V2
19: 2020-03-19 10.509701 310.788263 V2
20: 2020-03-20 28.680606 339.468868 V2
21: 2020-03-21 26.796640 366.265509 V3
22: 2020-03-22 21.091299 387.356807 V3
23: 2020-03-23 19.574698 406.931505 V3
24: 2020-03-24 29.833824 436.765328 V3
25: 2020-03-25 20.015468 456.780797 V3
26: 2020-03-26 21.547384 478.328180 V5-V6
27: 2020-03-27 16.777915 495.106095 V5-V6
28: 2020-03-28 18.230119 513.336214 V5-V6
29: 2020-03-29  9.385632 522.721846 V5-V6
30: 2020-03-30  5.266296 527.988142 V5-V6
31: 2020-03-31 28.927703 556.915844 V5-V6
32: 2020-04-01 27.166672 584.082517 V5-V6
33: 2020-04-02 21.030453 605.112970 V5-V6
34: 2020-04-03 24.068555 629.181525 V5-V6
35: 2020-04-04  1.713797 630.895322 V5-V6
36: 2020-04-05 14.856083 645.751405 V5-V6
37: 2020-04-06 22.995327 668.746732 V5-V6
38: 2020-04-07  7.275830 676.022562 V5-V6
39: 2020-04-08 10.227249 686.249811 V5-V6
40: 2020-04-09  7.717148 693.966959 V5-V6
          date      temp   cum_temp growth_stage

What is the best way to achieve these outcomes?

Data used to reproduce this problem:

# load required packages
library(data.table)

# generate data
dates <- seq(as.Date("2020-03-01"), as.Date("2020-08-31"), by="days")
set.seed(123); temps <- runif(length(dates), min=1, max=30)
dat <- data.table(date=dates,
                  temp=temps)
# cumulative sum
dat$cum_temp <- cumsum(dat$temp)

# table with growth stage thermal requirements 
sum_req <- data.table(growth_stage=c("VE","V2","V3","V5-V6","V7-V9","R2","R4","R5","R6"),
                      thermal_req=c(120,200,350,475,610,1660,1925,2450,2700))

Shouldn't be the `growth_stage` in `dat` starting in line 34 `V7-V9` instead of the given `V5-V6` ? — GKi, Aug 30 '23 at 06:48

LMc · Answer 1 · 2023-08-29T18:54:57.170

Base R has a cut function to divide values into intervals and optionally provide those intervals with a label. This will help accomplish the first task easily. Then you can simply filter that data frame to get the requested "date reached" data frame:

library(dplyr)

updated <- dat |>
  mutate(growth_stage = cut(cum_temp, 
                            breaks = c(sum_req$thermal_req, Inf), 
                            labels = sum_req$growth_stage, 
                            right = F))

#           date      temp   cum_temp growth_stage
#  1  2020-03-01  9.339748   9.339748         <NA>
#  2  2020-03-02 23.860849  33.200597         <NA>
#  3  2020-03-03 12.860331  46.060928         <NA>
#  4  2020-03-04 26.607505  72.668432         <NA>
#  5  2020-03-05 28.273551 100.941984         <NA>
#  6  2020-03-06  2.321138 103.263122         <NA>
#  7  2020-03-07 16.315059 119.578181         <NA>
#  8  2020-03-08 26.880152 146.458334          VE
#  9  2020-03-09 16.991615 163.449949          VE
# 10  2020-03-10 14.241827 177.691776          VE
# 11  2020-03-11 28.748167 206.439943          V2
# 12  2020-03-12 14.146691 220.586634          V2
# 13  2020-03-13 20.649548 241.236182          V2
# 14  2020-03-14 17.606369 258.842551          V2
# 15  2020-03-15  3.984816 262.827367          V2

updated |>
  slice_min(date, n = 1, by = growth_stage) |>
  rename(date_reached = date) |>
  filter(!is.na(growth_stage))

#   date_reached     temp cum_temp growth_stage
# 1   2020-03-08 26.88015 146.4583           VE
# 2   2020-03-11 28.74817 206.4399           V2

If you need the minimum requirements then you could easily join your sum_req data frame by growth_stage.

score 3 · Accepted Answer · answered Aug 29 '23 at 19:29

data.table

Up front, all working code in one block, no output:

dat[, nextdate := shift(date, type="lead", fill=last(date))]
sum_req[dat, date_reached := i.nextdate, on=.(thermal_req > cum_temp)]
dat[, nextdate := NULL] # cleanup
dat[sum_req, growth_stage := i.growth_stage, on = .(date == date_reached), roll = TRUE
  ][, growth_stage := zoo::na.locf(growth_stage, na.rm = FALSE)]

Walk-through

First, I'll add a nextdate to dat so that the inherent internal (rolling?) join finds the correct date. Then we'll join and assign this to sum_req:

dat[, nextdate := shift(date, type="lead", fill=last(date))]
sum_req[dat, date_reached := i.nextdate, on=.(thermal_req > cum_temp)]
#    growth_stage thermal_req date_reached
#          <char>       <num>       <Date>
# 1:           VE         120   2020-03-08
# 2:           V2         200   2020-03-11
# 3:           V3         350   2020-03-21
# 4:        V5-V6         475   2020-03-26
# 5:        V7-V9         610   2020-04-03
# 6:           R2        1660   2020-06-14
# 7:           R4        1925   2020-06-30
# 8:           R5        2450   2020-08-06
# 9:           R6        2700   2020-08-23
dat[, nextdate := NULL] # cleanup

Then I'll do a rolling join,

dat[sum_req, growth_stage := i.growth_stage, on = .(date == date_reached), roll = TRUE]
#            date     temp   cum_temp growth_stage
#          <Date>    <num>      <num>       <char>
#   1: 2020-03-01  9.33975    9.33975         <NA>
#   2: 2020-03-02 23.86085   33.20060         <NA>
#   3: 2020-03-03 12.86033   46.06093         <NA>
#   4: 2020-03-04 26.60750   72.66843         <NA>
#   5: 2020-03-05 28.27355  100.94198         <NA>
#   6: 2020-03-06  2.32114  103.26312         <NA>
#   7: 2020-03-07 16.31506  119.57818         <NA>
#   8: 2020-03-08 26.88015  146.45833           VE
#   9: 2020-03-09 16.99162  163.44995         <NA>
#  10: 2020-03-10 14.24183  177.69178         <NA>
#  ---                                            
# 175: 2020-08-22 20.38025 2699.72879         <NA>
# 176: 2020-08-23 18.92252 2718.65131           R6
# 177: 2020-08-24 11.79490 2730.44621         <NA>
# 178: 2020-08-25 16.36523 2746.81145         <NA>
# 179: 2020-08-26 26.36579 2773.17724         <NA>
# 180: 2020-08-27 17.87075 2791.04799         <NA>
# 181: 2020-08-28 25.35327 2816.40126         <NA>
# 182: 2020-08-29 10.06100 2826.46225         <NA>
# 183: 2020-08-30 21.54042 2848.00267         <NA>
# 184: 2020-08-31  8.68552 2856.68819         <NA>

and then LOCF the new growth_stage. Here I'm using zoo::na.locf since data.table::nafill requires numeric/integer.

dat[, growth_stage := zoo::na.locf(growth_stage, na.rm = FALSE)]
#            date     temp   cum_temp growth_stage
#          <Date>    <num>      <num>       <char>
#   1: 2020-03-01  9.33975    9.33975         <NA>
#   2: 2020-03-02 23.86085   33.20060         <NA>
#   3: 2020-03-03 12.86033   46.06093         <NA>
#   4: 2020-03-04 26.60750   72.66843         <NA>
#   5: 2020-03-05 28.27355  100.94198         <NA>
#   6: 2020-03-06  2.32114  103.26312         <NA>
#   7: 2020-03-07 16.31506  119.57818         <NA>
#   8: 2020-03-08 26.88015  146.45833           VE
#   9: 2020-03-09 16.99162  163.44995           VE
#  10: 2020-03-10 14.24183  177.69178           VE
#  ---                                            
# 175: 2020-08-22 20.38025 2699.72879           R5
# 176: 2020-08-23 18.92252 2718.65131           R6
# 177: 2020-08-24 11.79490 2730.44621           R6
# 178: 2020-08-25 16.36523 2746.81145           R6
# 179: 2020-08-26 26.36579 2773.17724           R6
# 180: 2020-08-27 17.87075 2791.04799           R6
# 181: 2020-08-28 25.35327 2816.40126           R6
# 182: 2020-08-29 10.06100 2826.46225           R6
# 183: 2020-08-30 21.54042 2848.00267           R6
# 184: 2020-08-31  8.68552 2856.68819           R6

For easy viewing, I'll show the change-points:

dat[, .SD[1,], by = .(growth_stage)]
#     growth_stage       date     temp   cum_temp
#           <char>     <Date>    <num>      <num>
#  1:         <NA> 2020-03-01  9.33975    9.33975
#  2:           VE 2020-03-08 26.88015  146.45833
#  3:           V2 2020-03-11 28.74817  206.43994
#  4:           V3 2020-03-21 26.79664  366.26551
#  5:        V5-V6 2020-03-26 21.54738  478.32818
#  6:        V7-V9 2020-04-03 24.06856  629.18152
#  7:           R2 2020-06-14 26.82016 1660.54649
#  8:           R4 2020-06-30 10.27480 1927.39382
#  9:           R5 2020-08-06 13.89827 2459.84312
# 10:           R6 2020-08-23 18.92252 2718.65131

(I think your sample output missed the transition to V7-V9?)

mweylandt · Answer 3 · 2023-08-29T18:24:51.753

Probably better ways of doing this, but if the dates are in order and the temperatures always increase, it could be as simple as:

dat$growth_stage <- NA
sum_req$date_reached <- NA

    for (i in 1:nrow(sum_req)) {
  
      indices <- which(dat$cum_temp >= sum_req$thermal_req[i])
      dat$growth_stage[indices] <- sum_req$growth_stage[i]

      first_occurrence <-  indices[1]
      sum_req$date_reached[i] <- dat$date[first_occurrence]
    }

sum_req$date_reached <- as.Date(sum_req$date_reached, origin = "1970-01-01")

Find which rows of the dataset have a temperature above the threshold, and then write the corresponding growth stage to the column. In the beginning that's a lot; as we get to different growth stages this column gets overwritten.

Again, works here, might not generalize well

score 1 · Answer 4 · answered Aug 30 '23 at 06:46

An option would be to use findInterval.

sum_req$date_reached <-
  dat$date[1+findInterval(sum_req$thermal_req, dat$cum_temp)]

dat$growth_stage <-
  c(NA, sum_req$growth_stage)[1+findInterval(dat$cum_temp, sum_req$thermal_req)]

sum_req
#   growth_stage thermal_req date_reached
#1:           VE         120   2020-03-08
#2:           V2         200   2020-03-11
#3:           V3         350   2020-03-21
#4:        V5-V6         475   2020-03-26
#5:        V7-V9         610   2020-04-03
#6:           R2        1660   2020-06-14
#7:           R4        1925   2020-06-30
#8:           R5        2450   2020-08-06
#9:           R6        2700   2020-08-23

head(dat, 40)
#          date      temp   cum_temp growth_stage
# 1: 2020-03-01  9.339748   9.339748         <NA>
# 2: 2020-03-02 23.860849  33.200597         <NA>
# 3: 2020-03-03 12.860331  46.060928         <NA>
# 4: 2020-03-04 26.607505  72.668432         <NA>
# 5: 2020-03-05 28.273551 100.941984         <NA>
# 6: 2020-03-06  2.321138 103.263122         <NA>
# 7: 2020-03-07 16.315059 119.578181         <NA>
# 8: 2020-03-08 26.880152 146.458334           VE
# 9: 2020-03-09 16.991615 163.449949           VE
#10: 2020-03-10 14.241827 177.691776           VE
#11: 2020-03-11 28.748167 206.439943           V2
#12: 2020-03-12 14.146691 220.586634           V2
#13: 2020-03-13 20.649548 241.236182           V2
#14: 2020-03-14 17.606369 258.842551           V2
#15: 2020-03-15  3.984816 262.827367           V2
#16: 2020-03-16 27.094924 289.922291           V2
#17: 2020-03-17  8.136544 298.058835           V2
#18: 2020-03-18  2.219726 300.278562           V2
#19: 2020-03-19 10.509701 310.788263           V2
#20: 2020-03-20 28.680606 339.468868           V2
#21: 2020-03-21 26.796640 366.265509           V3
#22: 2020-03-22 21.091299 387.356807           V3
#23: 2020-03-23 19.574698 406.931505           V3
#24: 2020-03-24 29.833824 436.765328           V3
#25: 2020-03-25 20.015468 456.780797           V3
#26: 2020-03-26 21.547384 478.328180        V5-V6
#27: 2020-03-27 16.777915 495.106095        V5-V6
#28: 2020-03-28 18.230119 513.336214        V5-V6
#29: 2020-03-29  9.385632 522.721846        V5-V6
#30: 2020-03-30  5.266296 527.988142        V5-V6
#31: 2020-03-31 28.927703 556.915844        V5-V6
#32: 2020-04-01 27.166672 584.082517        V5-V6
#33: 2020-04-02 21.030453 605.112970        V5-V6
#34: 2020-04-03 24.068555 629.181525        V7-V9
#35: 2020-04-04  1.713797 630.895322        V7-V9
#36: 2020-04-05 14.856083 645.751405        V7-V9
#37: 2020-04-06 22.995327 668.746732        V7-V9
#38: 2020-04-07  7.275830 676.022562        V7-V9
#39: 2020-04-08 10.227249 686.249811        V7-V9
#40: 2020-04-09  7.717148 693.966959        V7-V9

R- Find date when cumulative sum is above threshold

4 Answers4

data.table

Walk-through