2

I'm working with NHL player performance data, and have a data frame with the following variables (among others). war_82 is a measure of player value over a full 82 game season. The data spans 11 seasons, from 2007-2008 to 2017-2018.

 first_name last_name season    war_82
   <chr>      <chr>     <chr>      <dbl>
 1 5EBASTIAN  AHO       2017-2018 -0.560
 2 AARON      DELL      2016-2017  7.50 
 3 AARON      DELL      2017-2018  1.61 
 4 AARON      DOWNEY    2007-2008 -0.560
 5 AARON      EKBLAD    2014-2015  0.350
 6 AARON      EKBLAD    2015-2016 -0.350
 7 AARON      EKBLAD    2016-2017 -1.39 
 8 AARON      EKBLAD    2017-2018 -0.320
 9 AARON      JOHNSON   2007-2008 -1.42 
10 AARON      JOHNSON   2008-2009 -1.19 

I'd like to reduce the season-to-season variability of the war_82 metric, and create a new variable that's a weighted war_82. Ideally I'd look at 3 seasons of data, and have season n (the current season) be the most heavily weighted, and seasons n-1 and n-2 (the two preceding seasons) be less heavily weighted as recency decreases. Let's say weights of 0.5, 0.3, and 0.2 for argument's sake.

UPDATE FOR CLARITY: I'm hoping to calculate a weighted moving average. For example; Sidney Crosby's 20172018_weighted_war would be be determined by 2017-2018, 2016-2017, and 2015-2016. His 20162017_weighted_war would be be determined by 2016-2017, 2015-2016, and 2014-2015. So on and so forth.

I have two main questions:

1) What method would you recommend for this? I've looked at weighted.mean(), but some players have played more than others, so I'm not sure how to specify the "w" (weights) argument. For example, Sidney Crosby played during all 11 seasons in my data-set, but many players only played during 1 or 2 seasons. I don't really want to throw out data for players who have played fewer than 3 seasons.

2) How would you determine the weights for each season? The simplest method is the one I've mentioned above, which was sort of inspired by the Marcel method (https://www.beyondtheboxscore.com/2016/2/22/11079186/projections-marcel-pecota-zips-steamer-explained-guide-math-is-fun). I suppose you could also determine how well seasons n-1 and n-2 predict season n, and use those as your weights?

How would you approach this problem? Any and all guidance is greatly appreciated!

Luke Steer
  • 107
  • 1
  • 8

3 Answers3

2

I have a similar answer to JasonAizkalns, but it's different enough that I think it may be worth posting.

You can fiddle with the weights for the seasons.

EDIT: Added 'rolling average'

data <- readr::read_table("
first_name last_name season    war_82
5EBASTIAN  AHO       2017-2018 -0.560
AARON      DELL      2016-2017  7.50 
AARON      DELL      2017-2018  1.61 
AARON      DOWNEY    2007-2008 -0.560
AARON      EKBLAD    2014-2015  0.350
AARON      EKBLAD    2015-2016 -0.350
AARON      EKBLAD    2016-2017 -1.39 
AARON      EKBLAD    2017-2018 -0.320
AARON      JOHNSON   2007-2008 -1.42 
AARON      JOHNSON   2008-2009 -1.19")

weigth_war <- function(last3_war) {
    player_season <- as.numeric(stringr::str_split_fixed(last3_war, " ", 3))
    if (is.na(player_season[2]))
        player_season[1]
    else if (is.na(player_season[3]))
        weighted.mean(player_season[1:2], c(0.3, 0.7))
    else
        weighted.mean(player_season, c(0.2, 0.3, 0.5))
}

library(tidyverse)
data %>%
    mutate(name = paste(first_name, last_name)) %>%
    group_by(name) %>%
    arrange(name, season) %>%
    mutate(last3_war = paste(war_82, lag(war_82), lag(war_82, 2))) %>%
    ungroup() %>%
    rowwise() %>%
    mutate(weighted_war_82 = weigth_war(last3_war)) %>%
    select(name, season, war_82, weighted_war_82)
csgroen
  • 2,511
  • 11
  • 28
  • 1
    I like this better than mine +1 – JasonAizkalns Feb 14 '19 at 19:30
  • I think, however, based on your arrange statement, you may want to flip the weights -- i.e., `c(0.3, 0.7)` and `c(0.2, 0.3, 0.5)` -- regardless, the OP will want to pay extra attention to the ordering and test. – JasonAizkalns Feb 14 '19 at 19:40
  • Hey @csgroen, thanks very much, this is super helpful and almost exactly solves my problem. I don't think I properly explained this, but I'm hoping to calculate a weighted moving average. For example; Sidney Crosby's `20172018_weighted_war` would be be determined by 2017-2018, 2016-2017, and 2015-2016. His `20162017_weighted_war` would be be determined by 2016-2017, 2015-2016, and 2014-2015. So on and so forth. – Luke Steer Feb 14 '19 at 19:46
  • @JasonAizkalns, you're right, I've edited to fix the weights. Luke, I see. Yes, this would only calculate for the latest season in the way it's coded now. I'm not immediately sure of how to modify it to calculate a rolling average. You wouldn't be able to slice and summarize, but mutate could be used with some modifications to the `weight_war` function. – csgroen Feb 14 '19 at 20:49
  • 1
    @LukeSteer, it's not the most elegant code in the world, but I think it works. I had to use this 'last3war' ugly solution for lag because it wasn't behaving how I wanted when passing directly. I also had to do it `rowwise()` to make it easier to write the `weigth_war` function. Hopefully this solves your problem! – csgroen Feb 15 '19 at 12:46
  • 1
    @csgroen - FWIW, I think `tibbletime::rollify` might have the potential to use your original function and avoid the `paste` / `str_split_fixed` clutter. Regardless, if you haven't seen that package, it's got some cool functionality worth exploring. I also think `rowwise` may eventually go away -- I've been trying to break that habit. – JasonAizkalns Feb 15 '19 at 14:22
1

I would recommend sticking to one question per post. A brute-force approach to your first question would be to explicitly express the weights based on the number of seasons:

library(tidyverse)

df <- tribble(
  ~player, ~season, ~y,
  "dell", 2017, 1,
  "dell", 2018, 5,
  "johnson", 2016, 2,
  "johnson", 2017, 4,
  "johnson", 2018, 5,
  "downey", 2014, 3,
  "downey", 2015, 5
)

df %>%
  group_by(player) %>%
  arrange(player, season) %>%
  add_count(player, name = "num_seasons") %>%
  mutate(
    wtd = case_when(
      num_seasons == 1 ~ sum(                                           1.000 * nth(y, -1) ),
      num_seasons == 2 ~ sum(                      0.375 * nth(y, -2) + 0.625 * nth(y, -1) ),
      num_seasons == 3 ~ sum( 0.200 * nth(y, -3) + 0.300 * nth(y, -2) + 0.500 * nth(y, -1) )
    )
  )
#> # A tibble: 7 x 5
#> # Groups:   player [3]
#>   player  season     y num_seasons   wtd
#>   <chr>    <dbl> <dbl>       <int> <dbl>
#> 1 dell      2017     1           2  3.5 
#> 2 dell      2018     5           2  3.5 
#> 3 downey    2014     3           2  4.25
#> 4 downey    2015     5           2  4.25
#> 5 johnson   2016     2           3  4.1 
#> 6 johnson   2017     4           3  4.1 
#> 7 johnson   2018     5           3  4.1
JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116
  • Hey @JasonAizkalns, thanks very much; I also prefer csgroen's solution, but I really appreciate your feedback. I don't think I properly explained this, but I'm hoping to calculate a weighted moving average. For example; Sidney Crosby's 20172018_weighted_war would be be determined by 2017-2018, 2016-2017, and 2015-2016. His 20162017_weighted_war would be be determined by 2016-2017, 2015-2016, and 2014-2015. So on and so forth. – Luke Steer Feb 14 '19 at 20:01
  • I think you're going to have to use `lag` or `nth` -- in which case, an uglier solution (such as mine) might be necessary. You may want to re-ask the question and create a fully reproducible example that hits at the core issues. Consider keeping things simple like my `df` variable -- just the basics illustrating the issues that come up in the problem. Once you understand the approach/attack, I think it's easy to adopt to your specific question. Having all the "extras" upfront makes it more difficult to follow. Hope that makes sense? – JasonAizkalns Feb 14 '19 at 21:51
  • In your update, you mention `Sidney Crosby` but there's no `Sidney Crosby` in your provided example data. You might also have an issue with "gaps" -- what if a player participated in 2014, 2015, and 2017 -- i.e., they missed 2016 due to injury? Again, maybe a quick-n-dirty example illustrating the edge cases and your expected results might get you more traction. At that point, you may just want to close this question and ask a new one. – JasonAizkalns Feb 14 '19 at 21:54
0

You could use weighted.mean() as you suggest and weight by the number of seasons played out of 11 (1 season -> 0.091, 2 -> 0.18, etc).

TJ87
  • 404
  • 1
  • 3
  • 13