1

I am working with some tennis ranking data in R, that gives the evolution of tennis rankings over time of all the players on the ATP tour.

An example of the data I am using can be found here, giving the rankings data from 2000's: https://github.com/JeffSackmann/tennis_atp/blob/master/atp_rankings_00s.csv

To clean up the data:

rankings <- read_csv("data/atp/atp_rankings_00s.csv")
rankings = rankings %>% 
  mutate(rankingDate = lubridate::ymd(ranking_date) ) %>% 
  select(-ranking_date)

Now, suppose I wish to trace the time evolution of each player over the entire decade, and calculate their mean ranking during this period. Then I can write:

rankings %>%  
  group_by(player) %>% 
  summarise(
    meanRanking = mean(rank, na.rm = TRUE),
  ) 

However, suppose I want something more. I want to slice up this data along the time axis, and calculate the mean ranking for these slices. Thus, with something like start=01-01-2000, end=01-01-2008, skip=2 years, I could have mean rankings over 2-year time windows for the period from 1 Jan 2000 to 1 Jan 2008. How would one code such a 'time slicing` in R?

ap21
  • 2,372
  • 3
  • 16
  • 32

3 Answers3

2

Here's how I would get two-year slices of the data:

rankings %>%
  mutate(floor = 2 * floor(lubridate::year(rankingDate) / 2),
         slice = paste(floor, floor + 1, sep = '-')) %>%
  select(-floor) %>%
  group_by(player, slice) %>%
  summarize(average_ranking = mean(rank)) 
#> # A tibble: 13,537 x 3
#> # Groups:   player [5,711]
#>    player slice     average_ranking
#>     <int> <chr>               <dbl>
#>  1 100149 2000-2001           1332 
#>  2 100149 2004-2005           1052 
#>  3 100183 2004-2005           1477 
#>  4 100236 2000-2001           1106.
#>  5 100236 2002-2003            780 
#>  6 100236 2004-2005            779.
#>  7 100428 2000-2001            939.
#>  8 100428 2002-2003            976 
#>  9 100474 2000-2001           1298.
#> 10 100474 2002-2003            981 

We can use this to see the trajectory of players who averaged top 5 in any two-year period, to watch the stars dropping in and dropping out of contention.

rankings %>%
  mutate(floor = 2 * round(lubridate::year(rankingDate) / 2),
         slice = paste(floor, floor + 1, sep = '-')) %>%
  select(-floor) %>%
  group_by(player, slice) %>%
  summarize(average_ranking = mean(rank)) %>%
  filter(any(average_ranking < 5)) %>%
  ggplot(aes(slice, average_ranking, group = player, color = factor(player))) + 
  geom_line(size = 1) +
  coord_cartesian(ylim = c(0, 100)) +
  theme_light(base_size = 16) +
  theme(legend.position = 'none')

enter image description here

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
1

Your data are reasonably large, and data.table can help quite a bit with speed. Here is an approach that is very fast, and it uses a flexible function f(s,e,p,u), which allows you to pass in any start (s) or end (e) date, an integer period (e.g. 2 for 2 years), and time unit (u) which takes values "y", "m", or "d", for years, months, days, respectively

f <- function(s,e,p, u=c("y","m","d")) {
  u=match.arg(u)
  uf = list("y"=years,"m"=months,"d"=days)
  data.table(s = seq(as.Date(s), as.Date(e),by=paste(p,u)))[,`:=`(e=s %m+% uf[[u]](p), period=1:.N)]
}

Then you just apply the function to rankings in a non-equi join

setDT(rankings)
rankings[f("2000-01-01", "2008-01-01",2), on=.(ranking_date>=s, ranking_date<=e)] %>% 
  .[,.(ranking=mean(rank,na.rm=T)), by=.(player,period )]

Output:

       player period     ranking
    1: 101736      1    2.769231
    2: 102338      1    5.211538
    3: 101948      1    4.730769
    4: 103017      1   23.259615
    5: 102856      1    2.538462
   ---                          
13543: 105996      5 1780.500000
13544: 105050      5 1665.333333
13545: 105757      5 1781.000000
13546: 121555      5 1328.500000
13547: 106018      5 1508.000000

For 3 month windows, just call f with p=3, and u="m"

rankings[f("2000-01-01", "2008-01-01",3, "m"), on=.(ranking_date>=s, ranking_date<=e)] %>% 
  .[,.(ranking=mean(rank,na.rm=T)), by=.(player,period )]

Output (for 3 month windows)

       player period     ranking
    1: 101736      1    1.000000
    2: 102338      1    2.666667
    3: 101948      1    2.333333
    4: 103017      1    4.000000
    5: 102856      1    5.500000
   ---                          
62491: 105517     33 1502.000000
62492: 104405     33 1588.000000
62493: 104576     33 1588.000000
62494: 105500     33 1679.000000
62495: 108698     33 1844.000000
langtang
  • 22,248
  • 1
  • 12
  • 27
  • Thank you, speed is indeed of concern to me. Can you please explain what you mean by a `non-equi join' ? I have never heard this term, or seen this syntax before. – ap21 May 18 '22 at 12:37
  • its a join that uses non-equality operations, similar to SQL `select * from a join b on b.val – langtang May 18 '22 at 12:40
  • In my solution above, I'm joining `dt` with the result from `f`, using a non-equi join (i.e. `ranking_date` >= the `s` returned from `f()` and `ranking date` <= the `e` returned from `f()` – langtang May 18 '22 at 12:46
  • I am sorry for the delay. I tried running the code today, and I keep getting errors. If I run your code as is (I don't know what `dt` is), then I get the error: ```Error in .(ranking_date >= s, ranking_date <= e) : could not find function "."``` – ap21 May 24 '22 at 22:41
  • If I replace `dt` by `rankings`, which is the variable for my dataframe, I get a different error: ```Must subset rows with a valid subscript vector. Subscript `f("2000-01-01", "2008-01-01", 3, "m")` has the wrong type `data.table<``` Please advise. – ap21 May 24 '22 at 22:43
  • try `setDT(rankings)` to set your rankings frame as data table. see edit – langtang May 24 '22 at 22:55
0

Try out roll functions from the zoo package

library(zoo)
set.seed(2137)
my_data <- rnorm(100)
zoo::rollmean(my_data, 10)

Depending on data types and time frequency of your data it could look like this

rankings %>%  
  group_by(player) %>% 
  summarise(
    meanRanking = zoo::rollmean(rank, k = 2, na.rm = TRUE),
) 
SlavicDoomer
  • 181
  • 1
  • 5
  • 1
    I don't think the OP is describing a rolling mean, but a mean within each 2-year window. Though a rolling mean might be a better option. – Allan Cameron May 17 '22 at 20:43