0

The data is in the following format, where i have to group_by it using Date. For convenience i have shown it as numbers.

Msg <- c("Errors","Errors", "Start","Stop","Start","Stop","Errors","Errors","Start","Stop",
         "Stop" ,"Start","Errors","Start","Stop","Start" ,"Stop" ,
         "Errors", "Start","Errors","Stop", "Start", "LostControl","LostControl", "Errors",
         "Failed", "Stop", "Start","Failed","Stop","Stop","Start","Stop","Start","Error","Start",
         "Failed", "Stop")
Date <- c(11,11,11,11,11,11,11,12,12,12,12,12,12,14,14,14,14, 19,19,19,19,
        20,20,20,20,20,20,21,21,21,21,22,22,22,22,22,22,22)
data<- data.frame(Msg,Date)

I need to count the number of Failed in each START-STOP cycle, summarized by Date.
The data has three types of Messages. Errors and Failed are two type of Failure msgs, whereas LostControl is not a Failure. The condition is that a Failed msg shall not be preceded by a LostControl msg in that START-STOP cycle. If it is preceded by Errors only, it is Failure. Also, If only a "Errors" msg is found, it is also not counted as a Failure.

Edit: In the Msg vector, a START_STOP cycle is from extreme start to extreme stop iff two Starts or stops are found. If a START does not have a STOP follwing, it is ignored.

Edit one row added as - (Msg =Stop, Date=20)

Ram
  • 69
  • 7

1 Answers1

1

We can modify that function I wrote in your post yesterday.

between_valid_anchors <- function(x, bgn = "Start", end = "Stop") {
  are_anchors <- x %in% c(bgn, end)
  xid <- seq_along(x)
  id <- xid[are_anchors]
  x <- x[are_anchors]
  start_pos <- id[which(x == bgn & c("", head(x, -1L)) %in% c("", end))]
  stop_pos <- id[which(x == end & c(tail(x, -1L), "") %in% c("", bgn))]
  if (length(start_pos) < 1L || length(stop_pos) < 1L)
    return(logical(length(xid)))
  xid %in% unlist(mapply(`:`, start_pos, stop_pos))
}

Then just

library(dplyr)

data %>% 
  group_by(Date) %>% 
  filter(between_valid_anchors(Msg)) %>% 
  summarise(Msg = sum(Msg %in% c("Err", "Errors", "Failed")))

Output

`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 6 x 2
   Date   Msg
  <dbl> <int>
1    11     0
2    12     0
3    14     0
4    19     1
5    21     1
6    22     2

Update

You can add one more filter to select only the messages of interest (i.e. Start, Stop, Failed, LostControl). Then, just sum all Msg == "Failed" but not lag(Msg) == "LostControl"

library(dplyr)

data %>% 
  group_by(Date) %>% 
  filter(between_valid_anchors(Msg)) %>% 
  filter(Msg %in% c("Start", "Stop", "Failed", "LostControl")) %>% 
  summarise(Msg = sum(Msg == "Failed" & lag(Msg, default = "") != "LostControl"))

Output

`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 7 x 2
   Date   Msg
  <dbl> <int>
1    11     0
2    12     0
3    14     0
4    19     0
5    20     0
6    21     1
7    22     1
ekoam
  • 8,744
  • 1
  • 9
  • 22
  • Thanks :) There was a slight change though, in the last *summarise*, since i wanted to count only the *Failed*. So i modified it to ` %>% summarise(Msg = sum(Msg %in% c( "Failed"))) ` . Thanks again. – Ram Nov 02 '20 at 16:57
  • There is till a slight issue. If a *Failed* msg appears after *LostControl* msg appear, it is not counted as a failure. That is, order of appearance is important. Is there any way to take that into account? I have added One row into the Dataframe to show this case (Date 20 - last row is added) – Ram Nov 03 '20 at 03:48