1

I need to count events with two specific conditions and aggregate by year. My data example is below:

year <- c(rep(1981,20))
k1 <- c(rep(NA,5),rep("COLD",4),rep(NA,4),"COLD",NA,"COLD",rep(NA,4))
k2 <- c(rep(NA,10),rep("COLD",2),rep(NA,8))
k3 <- c(rep(NA,3),"COLD",rep(NA,16))
k4 <- c(rep(NA,3),rep("COLD",5),rep(NA,2),rep("COLD",5),NA,rep("COLD",4))
k5 <- c(rep(NA,3),"COLD",rep(NA,3),"COLD",rep(NA,3),"COLD",rep(NA,8))

df <- data.frame(year,k1,k2,k3,k4,k5)

I use rle, which I found easy to apply. My code is able to count the number of events with 5 consecutive records "COLD" and do it separately for each year. But here I need to add another condition, that between two separate events (which is 5 or more "COLD") should be at least 3 records "NA" (or three gaps), if less than 3 "NA", then it is the same event. My code:

rle_col = function(k_col, num = 5){
    k_col[is.na(k_col)] = "NA" # convert NAs
    r = rle(k_col) # run length encoding
    which_cold = r$values == "COLD"
    sum(r$lengths[which_cold] >= num)
}

result <- aggregate(df[2:6],by = list(df$year), rle_col)

I tried the code below, but unfortunately, it doesn't work as I expected... Any suggestions? THANKS!

rle_col = function(k_col, num = 5, numm = 3){
    k_col[is.na(k_col)] = "NA" # convert NAs
    r = rle(k_col) # run length encoding
    which_cold = r$values == "COLD"
    which_gap = r$values == "NA"
    sum(r$lengths[which_cold] >= num & r$lengths[which_gap] >= numm)

The result I want should look like this:

     year    k1    k2    k3    k4    k5
     <dbl> <int> <int> <int> <int> <int>
     1981     0     0     0     1     0
Indrute
  • 115
  • 10
  • Please show your expected output as well in the post so that it is easier to crosscheck – akrun Oct 28 '21 at 18:08
  • The 'year' column shows actually day with hour;minute;sec, So, it is confusing when you are grouping by 'year' i.e. each group have only a single row based on the input. – akrun Oct 28 '21 at 18:25
  • 1
    My apology. I edited the data example and added expected output. – Indrute Oct 28 '21 at 18:49

1 Answers1

1

We may use tidyverse

library(dplyr)
df %>% 
    group_by(year) %>% 
    summarise(across(starts_with('k'), rle_col))
# A tibble: 1 × 6
   year    k1    k2    k3    k4    k5
  <dbl> <int> <int> <int> <int> <int>
1  1981     0     0     0     1     0

where rle_col is

rle_col <-  function(k_col, num = 5) {

    with(rle(is.na(k_col)), {
           i1 <- values
            i1[values & lengths <3] <- 'Invalid'
            sum(!values & lengths >= 5 & 
        (lag(i1) != "Invalid"|lead(i1) != "Invalid"), na.rm = TRUE)

             })
 }
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Unfortunately, this answer does not solve my problem - to include the second condition with >=3 "NA". Whit new data it can be checked. k4 should be 1 instead of 2. k1 <- c(rep(NA,5),rep("COLD",4),rep(NA,4),"COLD",NA,"COLD",rep(NA,4)) > k2 <- c(rep(NA,10),rep("COLD",2),rep(NA,8)) > k3 <- c(rep(NA,3),"COLD",rep(NA,16)) > k4 <- c(rep(NA,3),rep("COLD",5),NA,rep("COLD",5),rep(NA,2),rep("COLD",4)) > k5 <- c(rep(NA,3),"COLD",rep(NA,3),"COLD",rep(NA,3),"COLD",rep(NA,8)) – Indrute Oct 28 '21 at 18:01
  • @Indrute can you show your expected. I was using only your function to do this – akrun Oct 28 '21 at 18:06
  • @Indrute can you try the updted code with `rle_col` – akrun Oct 28 '21 at 20:11
  • the new rle_col code gives me all columns with 0. There is something wrong there... – Indrute Oct 28 '21 at 21:05
  • @Indrute I tested on the data you showed in the post. It gives 1 though for k4 – akrun Oct 28 '21 at 21:07
  • @Indrute I double checked. It still gives the same result. May be you may need to check the data you used – akrun Oct 28 '21 at 21:24
  • yes, I found the problem, I should convert the "NA" values into true 's, now this code gives me an answer. But for some years it works well, and for some years it gives a wrong answer. I will check it tomorrow and add a comment. I guess that the problem is when k <- c(rep("COLD",5),rep("NA",3),rep("COLD",5),rep("NA", 3), rep("COLD",5)). – Indrute Oct 28 '21 at 21:45
  • @Indrute you don't use `"NA"` to create the NA. It should be without quotes i.e. `NA`. The `is.na` in the code only picks up the `NA` and not `"NA"` – akrun Oct 28 '21 at 21:47
  • Thank you, @akrun! I checked the code for some years and it seemed to work great. But then I discovered a year where it was somehow showing a mistake. Manually I can count 3 cases, but the code only gives me 2. I am adding data here... k <- c("COLD", rep(NA, 3), rep("COLD",7),NA, rep("COLD", 5), NA, rep("COLD",4), rep(NA, 3), "COLD", NA, NA, rep("COLD",2), NA, rep("COLD", 15), rep(NA, 4), "COLD", rep(NA, 5), rep("COLD",2), rep(NA, 7), rep("COLD",4), rep(NA,3), rep("COLD",5),NA, NA, "COLD", NA, rep("COLD", 2), rep(NA, 7)) year <- rep(1991, 90) df <- data.frame(year, k) – Indrute Oct 29 '21 at 09:29
  • The problem is in this part `"COLD", NA, NA, rep("COLD",2), NA, rep("COLD", 15), rep(NA, 4)`, Even if there was a sequence `"COLD", NA, NA, "COLD", NA, NA, rep("COLD", 15), rep(NA, 4)`, this code doesn't count `rep("COLD", 15)` as event. Only if there was sequence like `"COLD", rep(NA, 5), rep("COLD", 15), rep(NA, 4)` then it would be ok. I hope I explained it clearly enough. – Indrute Oct 29 '21 at 14:02
  • @Indrute i believe you need the `lead` as well. Can you try the update in my code i.e. `v1 <- c("COLD", NA, NA, rep("COLD",2), NA, rep("COLD", 15), rep(NA, 4)); rle_col(v1) [1] 1` – akrun Oct 29 '21 at 16:20
  • 1
    Thanks a lot, @akrun. I added `lead` as the main condition and edited the `lag` condition. Now it works perfectly, except for the beginning and the end of the group, but I think it can be corrected easily. My new `rle` code: `rle_col <- function(k_col, num = 5) { with(rle(is.na(k_col)), { i1 <- values i1[values & lengths < 2] <- 'Invalid' sum(!values & lengths >= 5 & (lead(i1) != "Invalid" & lag(i1)>=1), na.rm = TRUE) }) }` – Indrute Oct 29 '21 at 21:31