How to subset N rows above a selected point in a 'tidy' dataframe

Question

I've tried doing this using a combo of Dplyr filter and lag, but it isn't working. This stack overflow answer works for individual inputs, but not for an input dataframe as far as I can tell.

I have a dataframe of stock names, prices and dates. I would like to be able to input a second dataframe of stock names and dates and return those observations from the first dataframe, plus the N above, let's say N=1 for example. The observation dates are uneven so relying on it being N days beforehand won't work.

So for example if I have this data:

stock.data <- data.frame(
   stock_name = c("Walmart","Walmart","Walmart","Target","Target","Target"),
   price = c(100,101,102,201,202,203), 
   date = as.Date(c("2012-01-01", "2012-03-01", "2012-04-01", "2012-01-01",
      "2012-03-01","2012-04-01"))
   )

And in the other data frame, I have

other <- data.frame(
  stock_name = c("Walmart", "Target"), 
  date = as.Date(c("2012-03-01", "2012-04-01"))
)
N <- 1

I would like to get the rows with prices of 100, 101, 202 and 203.

Hopefully this makes sense and I'm happy to answer further questions.

For Walmart, shouldn't it be only the first row, i.e. all cases before 2012-03-01, which is only the first row? — deschen, Dec 09 '20 at 19:13
I think the typo is in the other direction - for `Target` the 203 row's date is in the other data frame, so I think the result should have the 202 and 203 rows, but not the 201 row. — Gregor Thomas, Dec 09 '20 at 19:16
@GregorThomas interpretation is correct, apologies have edited to be clearer — Cornelius, Dec 10 '20 at 01:58

score 1 · Answer 1 · answered Dec 09 '20 at 19:18

If you are looking for N = 1 or N = 2, I would do it like this:

library(dplyr)
stock.data %>% left_join(mutate(other, in_other = TRUE)) %>%
  filter(in_other | lead(in_other, 1))
# Joining, by = c("stock_name", "date")
#   stock_name price       date in_other
# 1    Walmart   100 2012-01-01       NA
# 2    Walmart   101 2012-03-01     TRUE
# 3     Target   202 2012-03-01       NA
# 4     Target   203 2012-04-01     TRUE

But this solution doesn't scale well to additional values of N.

Using this "other" data frame:

other = tribble(
  ~stock_name, ~date,
  "Walmart", "2012-03-01",
  "Target", "2012-04-01"
) %>% mutate(date = as.Date(date))

score 1 · Accepted Answer · answered Dec 09 '20 at 19:22

I'll use a function I wrote in a different answer, https://stackoverflow.com/a/58716950/3358272, called leadlag. The premise for that function is similar to lead or lag (in dplyr-speak) but it has a cumulative effect.

Up front: I'm assuming that this "N prior" is per-group (per stock_name), not generally throughout all stock names.

For this data, I'll add a unique id to each row and find the rows to keep:

stock.data$rn <- seq_len(nrow(stock.data))
rownums <- merge(stock.data, other_data)$rn

From there, let's lead/lag the filtering:

stock.data %>%
  group_by(stock_name) %>%
  filter(leadlag(rn %in% rownums, bef=1, aft=0)) %>%
  ungroup()
# # A tibble: 4 x 4
#   stock_name price date          rn
#   <chr>      <dbl> <date>     <int>
# 1 Walmart      100 2012-01-01     1
# 2 Walmart      101 2012-03-01     2
# 3 Target       202 2012-03-01     5
# 4 Target       203 2012-04-01     6

and if you wanted N=2 before, then

stock.data %>%
  group_by(stock_name) %>%
  filter(leadlag(rn %in% rownums, bef=2, aft=0)) %>%
  ungroup()
# # A tibble: 5 x 4
#   stock_name price date          rn
#   <chr>      <dbl> <date>     <int>
# 1 Walmart      100 2012-01-01     1
# 2 Walmart      101 2012-03-01     2
# 3 Target       201 2012-01-01     4
# 4 Target       202 2012-03-01     5
# 5 Target       203 2012-04-01     6

Data

stock.data <- data.frame(
  stock_name = c("Walmart","Walmart","Walmart","Target","Target","Target"),
  price = c(100,101,102,201,202,203), 
  date = as.Date(c("2012-01-01", "2012-03-01", "2012-04-01", "2012-01-01",
                   "2012-03-01","2012-04-01"))
)
other_data <- data.frame(
  stock_name = c("Walmart", "Target"),
  date = as.Date(c("2012-03-01", "2012-04-01"))
)

A copy of the leadlag function defined in the other answer:

#' Lead/Lag a logical
#'
#' @param lgl logical vector
#' @param bef integer, number of elements to lead by
#' @param aft integer, number of elements to lag by
#' @return logical, same length as 'lgl'
#' @export
leadlag <- function(lgl, bef = 1, aft = 1) {
  n <- length(lgl)
  bef <- min(n, max(0, bef))
  aft <- min(n, max(0, aft))
  befx <- if (bef > 0) sapply(seq_len(bef), function(b) c(tail(lgl, n = -b), rep(FALSE, b)))
  aftx <- if (aft > 0) sapply(seq_len(aft), function(a) c(rep(FALSE, a), head(lgl, n = -a)))
  rowSums(cbind(befx, lgl, aftx), na.rm = TRUE) > 0
}

How to subset N rows above a selected point in a 'tidy' dataframe

2 Answers2