2

I have a data set that amoungst other columns has date, sequence and low columns, please see df below. The sequence from 1-to-9 is considered one block or one full cycle in the sequence column The data set has several such complete block/cycles and partially completed ones, eg: 1-to-4

This is what I'm trying to solve:

  1. Remove the partly completed cycles and then group the full cycles (see df1)
  2. For each block/cycle (i.e. sequence from 1-to-9), I want to find the low of the block along with the day the low was happened.
  3. If there are two lows with same value but on different dates, then it should only output the latest date (see the 3rd block in output)

    library(lubridate)
    library(tidyverse)
    ### Sample data
    df <- data.frame(stringsAsFactors=FALSE,
    date = c("1/01/2019", "2/01/2019", "3/01/2019", "4/01/2019",
    "5/01/2019", "6/01/2019", "7/01/2019", "8/01/2019",
    "9/01/2019", "10/01/2019", "11/01/2019", "12/01/2019", "13/01/2019",
    "14/01/2019", "15/01/2019", "16/01/2019", "17/01/2019", "18/01/2019",
    "19/01/2019", "20/01/2019", "21/01/2019", "22/01/2019",
    "23/01/2019", "24/01/2019", "25/01/2019", "26/01/2019", "27/01/2019",
    "28/01/2019", "29/01/2019", "30/01/2019", "31/01/2019",
    "1/02/2019", "2/02/2019", "3/02/2019", "4/02/2019"),
    sequence = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8,
    9, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 9),
    low = c(96, 81, 43, 18, 43, 65, 48, 90, 69, 50, 41, 73, 1, 1, 7, 49,
    16, 79, 2, 74, 8, 88, 56, 57, 66, 29, 79, 51, 52, 47, 42, 9,
    41, 9, 50)) %>% mutate(date = dmy(date))
    

    Data grouped by cycle/block

    df1 <- data.frame(stringsAsFactors=FALSE,
        date = c("1/01/2019", "2/01/2019", "3/01/2019", "4/01/2019",
                 "5/01/2019", "6/01/2019", "7/01/2019", "8/01/2019",
                 "9/01/2019", "14/01/2019", "15/01/2019", "16/01/2019", "17/01/2019",
                 "18/01/2019", "19/01/2019", "20/01/2019", "21/01/2019", "22/01/2019",
                 "27/01/2019", "28/01/2019", "29/01/2019", "30/01/2019",
                 "31/01/2019", "1/02/2019", "2/02/2019", "3/02/2019", "4/02/2019"),
    sequence = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3,
                 4, 5, 6, 7, 8, 9),
         low = c(96, 81, 43, 18, 43, 65, 48, 90, 69, 1, 7, 49, 16, 79, 2, 74,
                 8, 88, 79, 51, 52, 47, 42, 9, 41, 9, 50),
       group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3,
                 3, 3, 3, 3, 3, 3)) %>% mutate(date = dmy(date))
    

Final output I am after

  df_final <- data.frame(stringsAsFactors=FALSE,
         date = c("4/01/2019", "14/01/2019", "3/02/2019"),
          low = c(18, 1, 9)) %>% mutate(date = dmy(date))

Any ideas?
Ps. I ran into some problems formatting this question, hence the untidiness.

jay.sf
  • 60,139
  • 8
  • 53
  • 110
cephalopod
  • 1,826
  • 22
  • 31

3 Answers3

2

We create the grouping variable by taking the cumulative sum where sequence is 1, then filter only the groups with 9 elements, and slice the rows where the 'low' is minimum after arrangeing the 'date' in descending order to take care of cases where there are ties for the 'low'est value

df %>% 
   group_by(group = cumsum(sequence == 1)) %>% 
   filter(n() == 9) %>% 
   select(date, low) %>%
   arrange(desc(date)) %>%
   slice(which.min(low)) %>%
   ungroup %>%
   select(-group)
# A tibble: 3 x 2
#  date         low
#  <date>     <dbl>
#1 2019-01-04    18
#2 2019-01-14     1
#3 2019-02-03     9

Or similar option with data.table

library(data.table)
setDT(df)[, .SD[.N == 9], .(group = cumsum(sequence == 1))
          ][order(-date), .SD[which.min(low)], group]
akrun
  • 874,273
  • 37
  • 540
  • 662
2

Another dplyr possibility could be:

df %>%
 group_by(group = cumsum(sequence == 1), rleid = with(rle(group), rep(seq_along(lengths), lengths))) %>%
 filter(all(c(1:9) %in% sequence)) %>%
 slice(which.min(rank(low, ties.method = "last"))) %>%
 ungroup() %>%
 select(-group, -rleid)

  date       sequence   low
  <date>        <dbl> <dbl>
1 2019-01-04        4    18
2 2019-01-14        1     1
3 2019-02-03        8     9

Here it, first, creates a cumulative sum of "sequence" == 1 and a rleid()-like variable based on the cumulative sum and then performs the grouping by the two. Second, it removes the cases where a sequence does not contain all of the nine values. Finally, it returns the minimum value per group, in the case of ties returning the last minimum value (you can modify it by the argument ties.method).

tmfmnk
  • 38,881
  • 4
  • 47
  • 67
  • 2
    I went with `Akrun` as his code is easier for me to understand. Wish I could upvote your answer more than once though! – cephalopod Jun 24 '19 at 06:08
1

This is also possible in base R. Might be a little mapsy, though.

w <- which(df$sequence == 1)
w <- w[sapply(w, function(x) df$sequence[x + 8] == 9 & sum(df$sequence[x:(x + 8)]) == 45)]
do.call(rbind, Map(function(x) x[which.min(x$low), ], 
                   Map(function(s) df[s, ], Map(seq, w, l=9))))
#          date sequence low
# 4  2019-01-04        4  18
# 14 2019-01-14        1   1
# 32 2019-02-01        6   9

The trick is to find the completed sequences and group them in a list, then rbind the which.min of each group. The sum(.) == 45 check should take account if there actually isn't a false sequence.

Data

df <- structure(list(date = structure(c(17897, 17898, 17899, 17900, 
17901, 17902, 17903, 17904, 17905, 17906, 17907, 17908, 17909, 
17910, 17911, 17912, 17913, 17914, 17915, 17916, 17917, 17918, 
17919, 17920, 17921, 17922, 17923, 17924, 17925, 17926, 17927, 
17928, 17929, 17930, 17931), class = "Date"), sequence = c(1, 
2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 9), low = c(96, 81, 43, 18, 
43, 65, 48, 90, 69, 50, 41, 73, 1, 1, 7, 49, 16, 79, 2, 74, 8, 
88, 56, 57, 66, 29, 79, 51, 52, 47, 42, 9, 41, 9, 50)), row.names = c(NA, 
-35L), class = "data.frame")
jay.sf
  • 60,139
  • 8
  • 53
  • 110