I have a data table structure of about 1.5 M rows and hundreds of columns, representing dates with horse racing results - this is to be used for a predictive model, but first feature engineering is necessary to calculate strike rates of various entities in terms of creating a prior record coming into every race for every previous day.
"Strike rate" can be defined in various ways, but a simple one is the ratio of wins to times run for any given horse, trainer, jockey etc. Of course this must take into account all previous runs and wins, but not include the results from "today" since this would be nonsense for building a model.
No matter, a simplified data structure, adapted from some examples online, will suffice to explain.
Generate data as follows:
n <- 90
dt <- data.table(
date=rep(seq(as.Date('2010-01-01'), as.Date('2015-01-01'), by='year'), n/6),
finish=c(1:5),
trainer=sort(rep(letters[1:5], n/5))
)
Imagine on these dates each trainer has a runner whose finish position in a race is represented by "finish". For a new date in the sequence (but not in this data), the ratio of times won so far could be calculated thus:
dt[order(trainer, date), .(strike_rate = sum(finish==1)/.N), by=trainer]
However, the resulting strike_rate variable shown for each trainer would only be valid for a new date in the sequence that is not in this dataset, say '2015-01-02', or our out of sample set.
To build the model, we need strike rates in line for each day and each trainer (and many other entities, but let's stick with trainer for now).
I've played around with the shift
function and data table constructs but cannot get it to work for this particular problem - however, in a loop context it works fine though is incredibly show.
To illustrate the required output, this example code (though I am sure it is not elegant!) works fine:
#order dates most recent to oldest so that the loop works backwards in time:
dt <- dt[order(-date)]
#find unique dates (converting to character as something weird with date)
dates = as.character(unique(dt$date))
for (d in dates) {
#find unique trainers on this date
trainers = unique(dt$trainer[dt$date==d])
for (t in trainers) {
trainer_past_form = dt[trainer==t & date < d]
strike_rate = sum(trainer_past_form$finish==1)/nrow(trainer_past_form)
# save this strike rate for this day and this trainer
dt$strike_rate[dt$trainer==t & dt$date==d] <- strike_rate
}
}
And gives the desired output:
date finish trainer strike_rate
1: 2015-01-01 1 a 0.2000000
2: 2015-01-01 2 a 0.2000000
3: 2015-01-01 3 a 0.2000000
4: 2015-01-01 4 b 0.2000000
5: 2015-01-01 5 b 0.2000000
6: 2015-01-01 1 b 0.2000000
7: 2015-01-01 2 c 0.2000000
8: 2015-01-01 3 c 0.2000000
9: 2015-01-01 4 c 0.2000000
10: 2015-01-01 5 d 0.2000000
11: 2015-01-01 1 d 0.2000000
12: 2015-01-01 2 d 0.2000000
13: 2015-01-01 3 e 0.2000000
14: 2015-01-01 4 e 0.2000000
15: 2015-01-01 5 e 0.2000000
16: 2014-01-01 5 a 0.1666667
17: 2014-01-01 1 a 0.1666667
18: 2014-01-01 2 a 0.1666667
19: 2014-01-01 3 b 0.2500000
20: 2014-01-01 4 b 0.2500000
21: 2014-01-01 5 b 0.2500000
22: 2014-01-01 1 c 0.1666667
23: 2014-01-01 2 c 0.1666667
24: 2014-01-01 3 c 0.1666667
25: 2014-01-01 4 d 0.1666667
26: 2014-01-01 5 d 0.1666667
27: 2014-01-01 1 d 0.1666667
28: 2014-01-01 2 e 0.2500000
29: 2014-01-01 3 e 0.2500000
30: 2014-01-01 4 e 0.2500000
31: 2013-01-01 4 a 0.1111111
32: 2013-01-01 5 a 0.1111111
33: 2013-01-01 1 a 0.1111111
34: 2013-01-01 2 b 0.3333333
35: 2013-01-01 3 b 0.3333333
36: 2013-01-01 4 b 0.3333333
37: 2013-01-01 5 c 0.1111111
38: 2013-01-01 1 c 0.1111111
39: 2013-01-01 2 c 0.1111111
40: 2013-01-01 3 d 0.2222222
41: 2013-01-01 4 d 0.2222222
42: 2013-01-01 5 d 0.2222222
43: 2013-01-01 1 e 0.2222222
44: 2013-01-01 2 e 0.2222222
45: 2013-01-01 3 e 0.2222222
46: 2012-01-01 3 a 0.1666667
47: 2012-01-01 4 a 0.1666667
48: 2012-01-01 5 a 0.1666667
49: 2012-01-01 1 b 0.3333333
50: 2012-01-01 2 b 0.3333333
51: 2012-01-01 3 b 0.3333333
52: 2012-01-01 4 c 0.0000000
53: 2012-01-01 5 c 0.0000000
54: 2012-01-01 1 c 0.0000000
55: 2012-01-01 2 d 0.3333333
56: 2012-01-01 3 d 0.3333333
57: 2012-01-01 4 d 0.3333333
58: 2012-01-01 5 e 0.1666667
59: 2012-01-01 1 e 0.1666667
60: 2012-01-01 2 e 0.1666667
61: 2011-01-01 2 a 0.3333333
62: 2011-01-01 3 a 0.3333333
63: 2011-01-01 4 a 0.3333333
64: 2011-01-01 5 b 0.3333333
65: 2011-01-01 1 b 0.3333333
66: 2011-01-01 2 b 0.3333333
67: 2011-01-01 3 c 0.0000000
68: 2011-01-01 4 c 0.0000000
69: 2011-01-01 5 c 0.0000000
70: 2011-01-01 1 d 0.3333333
71: 2011-01-01 2 d 0.3333333
72: 2011-01-01 3 d 0.3333333
73: 2011-01-01 4 e 0.0000000
74: 2011-01-01 5 e 0.0000000
75: 2011-01-01 1 e 0.0000000
76: 2010-01-01 1 a NaN
77: 2010-01-01 2 a NaN
78: 2010-01-01 3 a NaN
79: 2010-01-01 4 b NaN
80: 2010-01-01 5 b NaN
81: 2010-01-01 1 b NaN
82: 2010-01-01 2 c NaN
83: 2010-01-01 3 c NaN
84: 2010-01-01 4 c NaN
85: 2010-01-01 5 d NaN
86: 2010-01-01 1 d NaN
87: 2010-01-01 2 d NaN
88: 2010-01-01 3 e NaN
89: 2010-01-01 4 e NaN
90: 2010-01-01 5 e NaN
Any help on doing this "properly" in data table would be much appreciated. As can be seen, I have started to use the library but hitting a road block on this type of problem. I understand the logic of the loop, but it's just not efficient on 1.5M rows with lots of this type of calc to do across all variables.