Grouping by sequence and then finding minimum value in a column

Question

I have a data set that amoungst other columns has date, sequence and low columns, please see df below. The sequence from 1-to-9 is considered one block or one full cycle in the sequence column The data set has several such complete block/cycles and partially completed ones, eg: 1-to-4

This is what I'm trying to solve:

Remove the partly completed cycles and then group the full cycles (see df1)
For each block/cycle (i.e. sequence from 1-to-9), I want to find the low of the block along with the day the low was happened.

If there are two lows with same value but on different dates, then it should only output the latest date (see the 3rd block in output)

library(lubridate)
library(tidyverse)
### Sample data
df <- data.frame(stringsAsFactors=FALSE,
date = c("1/01/2019", "2/01/2019", "3/01/2019", "4/01/2019",
"5/01/2019", "6/01/2019", "7/01/2019", "8/01/2019",
"9/01/2019", "10/01/2019", "11/01/2019", "12/01/2019", "13/01/2019",
"14/01/2019", "15/01/2019", "16/01/2019", "17/01/2019", "18/01/2019",
"19/01/2019", "20/01/2019", "21/01/2019", "22/01/2019",
"23/01/2019", "24/01/2019", "25/01/2019", "26/01/2019", "27/01/2019",
"28/01/2019", "29/01/2019", "30/01/2019", "31/01/2019",
"1/02/2019", "2/02/2019", "3/02/2019", "4/02/2019"),
sequence = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8,
9, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 9),
low = c(96, 81, 43, 18, 43, 65, 48, 90, 69, 50, 41, 73, 1, 1, 7, 49,
16, 79, 2, 74, 8, 88, 56, 57, 66, 29, 79, 51, 52, 47, 42, 9,
41, 9, 50)) %>% mutate(date = dmy(date))

Data grouped by cycle/block

df1 <- data.frame(stringsAsFactors=FALSE,
    date = c("1/01/2019", "2/01/2019", "3/01/2019", "4/01/2019",
             "5/01/2019", "6/01/2019", "7/01/2019", "8/01/2019",
             "9/01/2019", "14/01/2019", "15/01/2019", "16/01/2019", "17/01/2019",
             "18/01/2019", "19/01/2019", "20/01/2019", "21/01/2019", "22/01/2019",
             "27/01/2019", "28/01/2019", "29/01/2019", "30/01/2019",
             "31/01/2019", "1/02/2019", "2/02/2019", "3/02/2019", "4/02/2019"),
sequence = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3,
             4, 5, 6, 7, 8, 9),
     low = c(96, 81, 43, 18, 43, 65, 48, 90, 69, 1, 7, 49, 16, 79, 2, 74,
             8, 88, 79, 51, 52, 47, 42, 9, 41, 9, 50),
   group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3,
             3, 3, 3, 3, 3, 3)) %>% mutate(date = dmy(date))

Final output I am after

  df_final <- data.frame(stringsAsFactors=FALSE,
         date = c("4/01/2019", "14/01/2019", "3/02/2019"),
          low = c(18, 1, 9)) %>% mutate(date = dmy(date))

Any ideas?
Ps. I ran into some problems formatting this question, hence the untidiness.

akrun · Accepted Answer · 2019-06-24T06:14:49.193

We create the grouping variable by taking the cumulative sum where sequence is 1, then filter only the groups with 9 elements, and slice the rows where the 'low' is minimum after arrangeing the 'date' in descending order to take care of cases where there are ties for the 'low'est value

df %>% 
   group_by(group = cumsum(sequence == 1)) %>% 
   filter(n() == 9) %>% 
   select(date, low) %>%
   arrange(desc(date)) %>%
   slice(which.min(low)) %>%
   ungroup %>%
   select(-group)
# A tibble: 3 x 2
#  date         low
#  <date>     <dbl>
#1 2019-01-04    18
#2 2019-01-14     1
#3 2019-02-03     9

Or similar option with data.table

library(data.table)
setDT(df)[, .SD[.N == 9], .(group = cumsum(sequence == 1))
          ][order(-date), .SD[which.min(low)], group]

@cephalopod Can you check the output – akrun Jun 24 '19 at 06:03 — akrun, Jun 24 '19 at 06:03
the last value 9 should have date of `2019-02-03` – cephalopod Jun 24 '19 at 06:04 — cephalopod, Jun 24 '19 at 06:04

tmfmnk · Answer 2 · 2019-06-24T06:13:29.940

Another dplyr possibility could be:

df %>%
 group_by(group = cumsum(sequence == 1), rleid = with(rle(group), rep(seq_along(lengths), lengths))) %>%
 filter(all(c(1:9) %in% sequence)) %>%
 slice(which.min(rank(low, ties.method = "last"))) %>%
 ungroup() %>%
 select(-group, -rleid)

  date       sequence   low
  <date>        <dbl> <dbl>
1 2019-01-04        4    18
2 2019-01-14        1     1
3 2019-02-03        8     9

Here it, first, creates a cumulative sum of "sequence" == 1 and a rleid()-like variable based on the cumulative sum and then performs the grouping by the two. Second, it removes the cases where a sequence does not contain all of the nine values. Finally, it returns the minimum value per group, in the case of ties returning the last minimum value (you can modify it by the argument ties.method).

I went with `Akrun` as his code is easier for me to understand. Wish I could upvote your answer more than once though! — cephalopod, Jun 24 '19 at 06:08

score 1 · Answer 3 · answered Jun 24 '19 at 10:01

This is also possible in base R. Might be a little mapsy, though.

w <- which(df$sequence == 1)
w <- w[sapply(w, function(x) df$sequence[x + 8] == 9 & sum(df$sequence[x:(x + 8)]) == 45)]
do.call(rbind, Map(function(x) x[which.min(x$low), ], 
                   Map(function(s) df[s, ], Map(seq, w, l=9))))
#          date sequence low
# 4  2019-01-04        4  18
# 14 2019-01-14        1   1
# 32 2019-02-01        6   9

The trick is to find the completed sequences and group them in a list, then rbind the which.min of each group. The sum(.) == 45 check should take account if there actually isn't a false sequence.

Data

df <- structure(list(date = structure(c(17897, 17898, 17899, 17900, 
17901, 17902, 17903, 17904, 17905, 17906, 17907, 17908, 17909, 
17910, 17911, 17912, 17913, 17914, 17915, 17916, 17917, 17918, 
17919, 17920, 17921, 17922, 17923, 17924, 17925, 17926, 17927, 
17928, 17929, 17930, 17931), class = "Date"), sequence = c(1, 
2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 9), low = c(96, 81, 43, 18, 
43, 65, 48, 90, 69, 50, 41, 73, 1, 1, 7, 49, 16, 79, 2, 74, 8, 
88, 56, 57, 66, 29, 79, 51, 52, 47, 42, 9, 41, 9, 50)), row.names = c(NA, 
-35L), class = "data.frame")

Grouping by sequence and then finding minimum value in a column

Data grouped by cycle/block

3 Answers3