1

I've tried doing this using a combo of Dplyr filter and lag, but it isn't working. This stack overflow answer works for individual inputs, but not for an input dataframe as far as I can tell.

I have a dataframe of stock names, prices and dates. I would like to be able to input a second dataframe of stock names and dates and return those observations from the first dataframe, plus the N above, let's say N=1 for example. The observation dates are uneven so relying on it being N days beforehand won't work.

So for example if I have this data:

stock.data <- data.frame(
   stock_name = c("Walmart","Walmart","Walmart","Target","Target","Target"),
   price = c(100,101,102,201,202,203), 
   date = as.Date(c("2012-01-01", "2012-03-01", "2012-04-01", "2012-01-01",
      "2012-03-01","2012-04-01"))
   )

And in the other data frame, I have

other <- data.frame(
  stock_name = c("Walmart", "Target"), 
  date = as.Date(c("2012-03-01", "2012-04-01"))
)
N <- 1

I would like to get the rows with prices of 100, 101, 202 and 203.

Hopefully this makes sense and I'm happy to answer further questions.

Cornelius
  • 13
  • 3
  • For Walmart, shouldn't it be only the first row, i.e. all cases before 2012-03-01, which is only the first row? – deschen Dec 09 '20 at 19:13
  • I think the typo is in the other direction - for `Target` the 203 row's date is in the other data frame, so I think the result should have the 202 and 203 rows, but not the 201 row. – Gregor Thomas Dec 09 '20 at 19:16
  • @GregorThomas interpretation is correct, apologies have edited to be clearer – Cornelius Dec 10 '20 at 01:58

2 Answers2

1

If you are looking for N = 1 or N = 2, I would do it like this:

library(dplyr)
stock.data %>% left_join(mutate(other, in_other = TRUE)) %>%
  filter(in_other | lead(in_other, 1))
# Joining, by = c("stock_name", "date")
#   stock_name price       date in_other
# 1    Walmart   100 2012-01-01       NA
# 2    Walmart   101 2012-03-01     TRUE
# 3     Target   202 2012-03-01       NA
# 4     Target   203 2012-04-01     TRUE

But this solution doesn't scale well to additional values of N.


Using this "other" data frame:

other = tribble(
  ~stock_name, ~date,
  "Walmart", "2012-03-01",
  "Target", "2012-04-01"
) %>% mutate(date = as.Date(date))
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
1

I'll use a function I wrote in a different answer, https://stackoverflow.com/a/58716950/3358272, called leadlag. The premise for that function is similar to lead or lag (in dplyr-speak) but it has a cumulative effect.

Up front: I'm assuming that this "N prior" is per-group (per stock_name), not generally throughout all stock names.

For this data, I'll add a unique id to each row and find the rows to keep:

stock.data$rn <- seq_len(nrow(stock.data))
rownums <- merge(stock.data, other_data)$rn

From there, let's lead/lag the filtering:

stock.data %>%
  group_by(stock_name) %>%
  filter(leadlag(rn %in% rownums, bef=1, aft=0)) %>%
  ungroup()
# # A tibble: 4 x 4
#   stock_name price date          rn
#   <chr>      <dbl> <date>     <int>
# 1 Walmart      100 2012-01-01     1
# 2 Walmart      101 2012-03-01     2
# 3 Target       202 2012-03-01     5
# 4 Target       203 2012-04-01     6

and if you wanted N=2 before, then

stock.data %>%
  group_by(stock_name) %>%
  filter(leadlag(rn %in% rownums, bef=2, aft=0)) %>%
  ungroup()
# # A tibble: 5 x 4
#   stock_name price date          rn
#   <chr>      <dbl> <date>     <int>
# 1 Walmart      100 2012-01-01     1
# 2 Walmart      101 2012-03-01     2
# 3 Target       201 2012-01-01     4
# 4 Target       202 2012-03-01     5
# 5 Target       203 2012-04-01     6

Data

stock.data <- data.frame(
  stock_name = c("Walmart","Walmart","Walmart","Target","Target","Target"),
  price = c(100,101,102,201,202,203), 
  date = as.Date(c("2012-01-01", "2012-03-01", "2012-04-01", "2012-01-01",
                   "2012-03-01","2012-04-01"))
)
other_data <- data.frame(
  stock_name = c("Walmart", "Target"),
  date = as.Date(c("2012-03-01", "2012-04-01"))
)

A copy of the leadlag function defined in the other answer:

#' Lead/Lag a logical
#'
#' @param lgl logical vector
#' @param bef integer, number of elements to lead by
#' @param aft integer, number of elements to lag by
#' @return logical, same length as 'lgl'
#' @export
leadlag <- function(lgl, bef = 1, aft = 1) {
  n <- length(lgl)
  bef <- min(n, max(0, bef))
  aft <- min(n, max(0, aft))
  befx <- if (bef > 0) sapply(seq_len(bef), function(b) c(tail(lgl, n = -b), rep(FALSE, b)))
  aftx <- if (aft > 0) sapply(seq_len(aft), function(a) c(rep(FALSE, a), head(lgl, n = -a)))
  rowSums(cbind(befx, lgl, aftx), na.rm = TRUE) > 0
}
r2evans
  • 141,215
  • 6
  • 77
  • 149