Bayesian Linear Regression with PyMC3 and a large dataset - bracket nesting level exceeded maximum and slow performance

Question

I would like to use a Bayesian multivariate linear regression to estimate the strength of players in team sports (e.g. ice hockey, basketball or soccer). For that purpose, I create a matrix, X, containing the players as columns and the matches as rows. For each match the player entry is either 1 (player plays in the home team), -1 (player plays in the away team) or 0 (player does not take part in this game). The dependent variable Y is defined as the scoring differences for both teams in each match (Score_home_team - Score_away_team).

Thus, the number of parameters will be quite large for one season (e.g. X is defined by 300 rows x 450 columns; i.e. 450 player coefficients + y-intercept). When running the fit I came across a compilation error:

('Compilation failed (return status=1): /Users/me/.theano/compiledir_Darwin-17.7.0-x86_64-i386-64bit-i386-3.6.5-64/tmpdxxc2379/mod.cpp:27598:32: fatal error: bracket nesting level exceeded maximum of 256.

I tried to handle this error by setting:

theano.config.gcc.cxxflags = "-fbracket-depth=1024"

Now, the sampling is running. However, it is so slow that even if I take only 35 of 300 rows the sampling is not completed within 20 minutes.

This is my basic code:

import pymc3 as pm
basic_model = pm.Model()

with basic_model:

    # Priors for beta coefficients - these are the coefficients of the players
    dict_betas = {}
    for col in X.columns:
        dict_betas[col] = pm.Normal(col, mu=0, sd=10)

    # Priors for unknown model parameters
    alpha = pm.Normal('alpha', mu=0, sd=10) # alpha is the y-intercept
    sigma = pm.HalfNormal('sigma', sd=1) # standard deviation of the observations

    # Expected value of outcome
    mu = alpha
    for col in X.columns:
        mu = mu + dict_betas[col] * X[col] # mu = alpha + beta_1 * Player_1 + beta_2 * Player_2 + ...

    # Likelihood (sampling distribution) of observations
    Y_obs = pm.Normal('Y_obs', mu=mu, sd=sigma, observed=Y)

The instantiation of the model runs within one minute for the large dataset. I do the sampling using:

with basic_model:

    # draw 500 posterior samples
    trace = pm.sample(500)

The sampling is completed for small sample sizes (e.g. 9 rows, 80 columns) within 7 minutes. However, the time is increasing substantially with increasing sample size.

Any suggestions how I can get this Bayesian linear regression to run in a feasible amount of time? Are these kind of problems doable using PyMC3 (remember I came across a bracket nesting error)? I saw in a recent publication that this kind of analysis is doable in R (https://arxiv.org/pdf/1810.08032.pdf). Therefore, I guess it should also somehow work with Python 3.

Any help is appreciated!

Perhaps try getting this into a dot product form instead of using `for` loops. Something like `beta = pm.Normal('beta', mu=0, sd=10, shape=X.shape[1])` and the `mu = alpha + pm.math.dot(X, beta)`. Maybe [this other answer might help](https://stackoverflow.com/a/53218462/570918), which also demonstrates how to augment `X` to include the intercept and avoid having a separate `alpha` variable. — merv, Dec 09 '18 at 04:01
Merv, thank you very much for your helpful comment. Your suggestion to use the dot product instead of the for loops/dictionary solved both, the bracket nesting problem and the slow performance problem. The program runs fine with all betas set to mu = 0. However, now I don't know how to include different mus for each player? Is there a way to include different mus and sigmas to beta using the definition of beta you have suggested? At the end I would like to run the Bayesian linear regression with different priors for each player. — P. Rinter, Dec 09 '18 at 15:41

score 4 · Accepted Answer · answered Dec 09 '18 at 22:13

Eliminating the for loops should improve performance and might also take care of the nesting issue you are reporting. Theano TensorVariables and the PyMC3 random variables that derive from them are already multidimensional and support linear algebra operations. Try changing your code to something along the lines of

beta = pm.Normal('beta', mu=0, sd=10, shape=X.shape[1])
...
mu = alpha + pm.math.dot(X, beta)
...

If you need specify different prior values for mu and/or sd, those arguments accept anything that theano.tensor.as_tensor_variable() accepts, so you can pass a list or numpy array.

I highly recommend getting familiar with the theano.tensor and pymc3.math operations since sometimes you must use these to properly manipulate random variables, and in general it should lead to more efficient code.

Perfect, Merv! Thank you very much! mu and sd accept numpy arrays. In my case mu also accepts a list but sd does not. Using numpy arrays everything works fine! — P. Rinter, Dec 10 '18 at 18:56

Bayesian Linear Regression with PyMC3 and a large dataset - bracket nesting level exceeded maximum and slow performance

1 Answers1

Linked