I (new to PyMC3) want to extend the model proposed in the PyMC3 example A Hierarchical model for Rugby prediction by making the latent variables for attack and defence strength autoregressive. I am unsure about how to use 2-dimensional data and the shape parameter of the GaussianRandomWalk
class (explained below).
Edit 1: I did not find explicit documentation of multi-dimensional usage, but I found this comment by fonnesbeck among the PyMC3 github issues:
[...] I think most people would expect a vector of variables, which implies that the first dimension is the number of variable elements and the remaining dimension(s) the size of each variable.
As defined below, I use the time index as my 1st dimension. I tried to switch axis, which yields the same result. So, my current model is:
with pm.Model() as model:
home = pm.Normal('home', 0, .0001)
intercept = pm.Normal('intercept', 0, .0001)
tau_att = pm.Exponential('tau_att', 1./.02)
tau_def = pm.Exponential('tau_def', 1./.02)
atts = pm.GaussianRandomWalk('atts', tau_att**-2, shape=[T, num_teams])
defs = pm.GaussianRandomWalk('defs', tau_def**-2, shape=[T, num_teams])
home_theta = tt.exp(intercept + home + atts[:, home_team] + defs[:, away_team])
away_theta = tt.exp(intercept + atts[:, away_team] + defs[:, home_team])
home_points = pm.Poisson('home_points', mu=home_theta, observed=observed_home_goals)
away_points = pm.Poisson('away_points', mu=away_theta, observed=observed_away_goals)
The input is a two dimensional array, with rows being the time steps and columns containing the home or away goals for all teams that played in that time step. Assuming the following mock data
home_score away_score home_team away_team i_home i_away t
0 1 0 Arsenal Liverpool 0 1 1
1 1 1 Liverpool Burnley 1 2 1
2 2 4 Burnley Arsenal 2 0 1
3 0 3 Liverpool Arsenal 1 0 2
4 1 1 Burnley Liverpool 2 1 2
5 5 0 Arsenal Burnley 0 2 2
observed_home_goals
(similar to observed_away_goals
) would look like this:
[[1 1 2]
[0 1 5]]
and the corresponding team index would look like this:
[[0 1 2]
[1 2 0]]
meaning that in time step 1, teams [0 1 2]
shot [1 1 2]
goals respectively.
Fitting the model does not throw errors, sampling however yields zero estimates for all parameters. I tried to browse the distributions/timeseries.py source code about how the shape parameter would be used for multiple dimensions in the GaussianRandomWalk
class.
My question is, if the model definition would actually work as intended for the 2-dimensional time series data. I am not sure, if I index the atts
and defs
variables correctly.
Edit 2: I ended up building the time series manually, similar to Javier's solution, which seems to work fine!