I tried to solve a logistic regression model using PyMC. However, the diagnostic plots show very high autocorrelations and after repeated sampling from the posterior distribution, I sometimes obtain very different results, so maybe I'm not using PyMC properly.
The model is the following:
Y_i = Bernoulli(p_i)
logit(p_i) = b0 + b1*x1 + b2*x2
The implementation is this:
import pymc as pm
import numpy as np
b0 = pm.Normal('b0', 0., 1e-6, value=0.)
b1 = pm.Normal('b1', 0., 1e-6, value=0.)
b2 = pm.Normal('b2', 0., 1e-6, value=0.)
x1 = np.array([31, 31, 36, 30, 32, 33, 31, 33, 32])
x2 = np.array([31, 16, 35, 42, 19, 37, 29, 23, 15])
value = np.array([1, 1, 1, 1, 0, 1, 1, 1, 0])
@pm.deterministic
def p(b0=b0, b1=b1, b2=b2, x1=x1, x2=x2):
return np.exp(b0 + b1*x1 + b2*x2)/(1.0 + np.exp(b0 + b1*x1 + b2*x2))
obs = pm.Bernoulli('obs', p=p, value=value, observed=True)
m = pm.MCMC([obs, b0, b1, b2])
When I sample with m.sample(500000, 200000, 50)
and plot the resulting posterior distribution, I get this:
In a second attempt to get better results, I used @pm.observed:
import pymc as pm
import numpy as np
b0 = pm.Normal('b0', 0., 1e-6, value=0.)
b1 = pm.Normal('b1', 0., 1e-6, value=0.)
b2 = pm.Normal('b2', 0., 1e-6, value=0.)
x1 = np.array([31, 31, 36, 30, 32, 33, 31, 33, 32])
x2 = np.array([31, 16, 35, 42, 19, 37, 29, 23, 15])
value = np.array([1, 1, 1, 1, 0, 1, 1, 1, 0])
p = b0 + b1*x1 + b2*x2
@pm.observed
def obs(p=p, value=value):
return pm.bernoulli_like(value, pm.invlogit(p))
m = pm.MCMC([obs, p, b0, b1, b2])
But it also produces high autocorrelations.
I increased the sample size without much success. What am I missing?