Create a data frame with lagged values as numeric vector using dplyr

Question

My data looks as follows

data <- data.frame(
  A = c(10,20,30,40,50,60,70,80,90,100),
  B = c(110,120,130,140,150,160,170,180,190,200)
)

I wish to create a new column where row-wise I take each value from column A and concatenate it with the corresponding value from B as well as the lagged 3 values from column B and stores it in a numeric vector.

The final data will look as follows:

    new_data <- data.frame(
  A = c(10,20,30,40,50,60,70,80,90,100),
  B = c(110,120,130,140,150,160,170,180,190,200),
  c = I(list(c(10,110,NA,NA,NA),c(20,120,110,NA,NA),c(30,130,120,110,NA),c(40,140,130,120,110),c(50,150,140,130,120),c(60,160,150,140,130),c(70,170,160,150,140),c(80,180,170,160,150),c(90,190,180,170,160),c(100,200,190,180,170))
))

In my real data, I need to capture up to 200 lagged values so manually typing list(A, lag(B),lag(B,2),lag(B,3).... is not a road I want to go down.

I've tried using purrr::map to achieve my results but I can't get the lagged values into my vector

data %>%
  rowwise() %>% 
  mutate(z = list(c(A,B,map_vec(1:3, ~lag(B,.x, default = NA)))))

The ultimate goal is to calculate the percentage_rank of each value of column A relative to the current row + 3 previous values from column B to identify whether each point of A is an outlier i.e. > 75th percentile.

My approach is to create a numeric vector of values in column C that contains the current value of A and the 4 values from B, apply dplyr::percent_rank to each vector and pluck the percent rank for each index = 1 value (column A value) to get me the percent rank of value A.

Any guidance you can provide would be greatly appreciated.

Andre Wildberg · Answer 1 · 2023-06-21T14:34:15.890

3

EDIT: new data and variable lag-length

Set <- 3 # For e.g. all rows use "Set <- (nrow(data)-1)"

data %>% 
  mutate(setNames(data.frame(sapply(1:!!Set, \(x) 
    lag(B, x))), paste0("b", 1:!!Set))) %>% 
  rowwise() %>% 
  mutate(C = list(c_across(everything()))) %>% 
  select(-matches("b\\d+")) %>% 
  data.frame()
     A   B                       C
1   10 110     10, 110, NA, NA, NA
2   20 120    20, 120, 110, NA, NA
3   30 130   30, 130, 120, 110, NA
4   40 140  40, 140, 130, 120, 110
5   50 150  50, 150, 140, 130, 120
6   60 160  60, 160, 150, 140, 130
7   70 170  70, 170, 160, 150, 140
8   80 180  80, 180, 170, 160, 150
9   90 190  90, 190, 180, 170, 160
10 100 200 100, 200, 190, 180, 170

Using a pre-rowwise lag variable

library(dplyr)

df %>% 
  mutate(b = lag(B)) %>% 
  rowwise() %>% 
  mutate(C = list(c(A, B, b)), b = NULL) %>% 
  data.frame()
   A   B           C
1 10  60  10, 60, NA
2 20  70  20, 70, 60
3 30  80  30, 80, 70
4 40  90  40, 90, 80
5 50 100 50, 100, 90

edited Jun 21 '23 at 14:34

answered Jun 20 '23 at 22:34

Andre Wildberg

12,344
3
12
29

thanks for your help and answer, I'm very grateful. In reality I need to create column C with up to 200 lagged values and so that changes the approach used. I tried creating the reprex to give the overview but never figured your elegant solution would break my logic so easily. Apologies for any wasted time. – TheGoat Jun 21 '23 at 07:43
@TheGoat No worries. So do you need 3 lagged values like in the new example or ultimately a variable amount of lagged values (e.g. number of rows)? – Andre Wildberg Jun 21 '23 at 13:21
ideally I'd like to set the number of lags as a parameter to determine the number of rows to include in the lookback period. Thanks – TheGoat Jun 21 '23 at 13:35
@TheGoat See edit, now handles variable amount of lags. – Andre Wildberg Jun 21 '23 at 14:36
Wildberg thanks again for looking at this. I've tried the code above but getting the following error: "Error in quos(..., .ignore_empty = "all") : object 'Set' not found". I've updated all my packages just in case but I'm still getting the same error, any suggestions? Thanks – TheGoat Jun 21 '23 at 15:13
1

@TheGoat Did you forget the first line `Set <- 3`? – Andre Wildberg Jun 21 '23 at 15:28

score 0 · Answer 2 · answered Jun 21 '23 at 18:51

With base R, you can try embed like below

L <- 3
transform(
    data,
    C = asplit(cbind(A, B, embed(c(rep(NA, L), head(B, -1)), L)), 1)
)

and you will obtain

     A   B                       C
1   10 110     10, 110, NA, NA, NA
2   20 120    20, 120, 110, NA, NA
3   30 130   30, 130, 120, 110, NA
4   40 140  40, 140, 130, 120, 110
5   50 150  50, 150, 140, 130, 120
6   60 160  60, 160, 150, 140, 130
7   70 170  70, 170, 160, 150, 140
8   80 180  80, 180, 170, 160, 150
9   90 190  90, 190, 180, 170, 160
10 100 200 100, 200, 190, 180, 170

Create a data frame with lagged values as numeric vector using dplyr

2 Answers2

Linked