0

I'm trying to create several new columns by calculating the previous n values of 2 columns.

Example using Auto. I made a column that calculates the average Horsepower per 1000lbs for the previous two model-years:

library(ISLR)
library(dplyr)    
LaggedAuto <- Auto %>% 
                  arrange(year) %>% 
                  group_by(name) %>% 
                  mutate(L2HPbyWT = 1000*((lag(horsepower)+lag(horsepower,2))/(lag(weight)+lag(weight,2))))

So the (weight, HP) of the 80,81,82 Corollas were (2265, 75), (2350, 75), (2245,70). The new column on the Corolla '82 row would therefore be 1000*(75+75)/(2265+2350) = 32.50. So it's more or less the average of HP/Weight, weighted by Weight.

In my own dataset there's about 12 columns I want to do this for with a lag of 5+. This means if I continue with the current technique, I'll be doing a lot of typing (lag(,1) + lag(,2) + ... + lag(,5)) for 11 more variables making it hard to update/tweak. Is there a way to perform this sort of lagged weighted calc on multiple variables for differing lags in an easy way?

I have a feeling I want to use zoo::rollaplyr based on a previous question of mine and another question I found (Mutate multiple / consecutive columns (with dplyr or base R)) but I'm having issues determining what the function would be. To keep it to this example, could you help me figure out how to also calculate the average of the last two instances of MPG, displacement, and acceleration by 1000lbs weight?

CoolGuyHasChillDay
  • 659
  • 1
  • 6
  • 21

1 Answers1

0

Firstly, please indicate the source of the data so we may use it. I found it under library(ISLR).

So, I hate the stats::lag function. Adding the lagged values from the lag function is faulty. I'll show you what I mean:

lag(Auto$horsepower[1:8], 1)
#[1] 130 165 150 150 140 198 220 215
#attr(,"tsp")
#[1] 0 7 1

lag(Auto$horsepower[1:8], 2)
#[1] 130 165 150 150 140 198 220 215
#attr(,"tsp")
#[1] -1  6  1

lag(Auto$horsepower[1:8], 1) + lag(Auto$horsepower[1:8], 2)
#[1] 260 330 300 300 280 396 440 430
#attr(,"tsp")
#[1] 0 7 1

This doesn't work. It gives you the means to subset your data correctly but is not really usable. What you really want is something like:

shift(Auto$horsepower[1:8], 1:2)
#     [,1] [,2]
#[1,]    0    0
#[2,]  130    0
#[3,]  165  130
#[4,]  150  165
#[5,]  150  150
#[6,]  140  150
#[7,]  198  140

rowSums(shift(Auto$horsepower[1:8], 1:2))
#[1]   0 130 295 315 300 290 338 418

This would be the perfect vectorized function to make someone's life easier for both leads and lags. So I wrote it below:

  shift <- function(x, i = 1, NA2zero = TRUE, naming = NULL){
  Z <- ifelse(NA2zero, 0, NA)
  L <- sapply(i, function(i){ "if"(i > 0, 
                                     c(rep(Z, max(abs(i))), 
                                       x[-c((length(x)+1-i):length(x))]),
                                     "if"(i < 0, 
                                             c(x[-c(1:abs(i))], 
                                               rep(Z, max(abs(i))) ),
                                             x))
              })
  "if"(is.null(naming), 
       colnames(L) <- paste0(deparse(substitute(x)),".",i),
       colnames(L) <- paste0(naming,".",i))
  return(L)
}

Now you can easily go ahead and fix your code with:

L2HPbyWT = 1000*((rowSums(shift(Auto$horsepower, i = 1:2)) / rowSums(shift(Auto$weight, i = 1:2))))

I even added a fun little naming ability:

head(shift(Auto$horsepower, 0:2, naming = "HP"),3)
     HP.0 HP.1 HP.2
[1,]  130    0    0
[2,]  165  130    0
[3,]  150  165  130

EDIT: it appears you never needed a lag function afterall! Now I can dive into your question. I never got into dplyr so this will be in base so forgive me. It appears you are lagging different years of cars. So if we take a look at the cars: dim(Auto) #[1] 392 9 # pretty big, a lot of rows.

# split them into groups by type of car
eachAuto <- split(Auto, Auto$name)
table(sapply(eachAuto, nrow))
#  0   1   2   3   4   5 # lengths
#  3 245  34  12   7   3 # counts

Now we see that there are 3 cars with 0 rows (there are some factor levels that have no data), 245 cars with only 1 row, 12 cars with 3 rows, and so on.

Using the name column seems to be a mistake here... unless we restrict the cars that we can use? How about:

MAXLAG <- 2
Autos_subset <- eachAuto[sapply(eachAuto, nrow) > (MAXLAG-1)]
newAuto <- lapply(Autos_subset, function(x) {
  x$L2HPbyWT <- 1000*((rowSums(shift(x$horsepower, i = 1:MAXLAG)) / rowSums(shift(x$weight, i = 1:MAXLAG))))
  x
  })

length(newAuto) # 56 car names in the list 

Now you are accurately only using the cars that are available for lagging. Let me know if I am missing the mark because I may very well be.

Evan Friedland
  • 3,062
  • 1
  • 11
  • 25
  • Thanks for the help! Sorry about `Auto`, I thought it was in base, didn't realize it was in ISLR. Also, the `lag` I'm using is out of dpylr (didn't even know there was a stats::lag), so `lag(Auto$horsepower[1:8], 1) + lag(Auto$horsepower[1:8], 2)` does result in `NA NA 295 315 300 290 338 418` as desired. I will still check out your code! – CoolGuyHasChillDay May 31 '17 at 16:27
  • Nice right on, thanks! Did you happen to read my previous comment? The `lag` I'm using is through `dplyr` not `stats` so it works in the same way yours does, but with my code I'm able to segment by model and year. Re: "not exactly sure ... the point of lagging over non-time series data", the time series is the model of the car (2 digit number in `year`). My original code is doing exactly what it should, I just need to do it for 15+ columns. Doing this manually is fine, but my original question was asking whether it could be repeated over several columns. – CoolGuyHasChillDay May 31 '17 at 18:20
  • Please see my edits. I'm gaining a small understanding but it is looking strange – Evan Friedland May 31 '17 at 18:57