Good morning,
I'm in the process of learning PyMC3 and to start, I generated some synthetic data to use in estimating a Poisson regression. One thing I realized quickly is that I needed to make my model covariates zero mean and unit variance prior to fitting the model to avoid numerical issues. I performed the normalization outside of the PyMC3 model specification which leaves me wondering if there's some way to integrate it? If the normalization is not integrated, I'm left having to estimate the means and standard deviations of the covariates from the training data and make sure I apply that same normalization to future test data. I'd love it if that could be part of the model itself so that I don't have to always apply these external operations as well. Is that possible? If so, I'd love any pointers to examples of how to do it.
Thanks!
Chris
Here's an example I've been experimenting with that helps clarify my question. I'm wondering how I might move the normalization of the covariates into the PyMC3 model specification at the end of the code block.
# Generate synthetic data
N = 10000
tree_height = []
wind_speed = []
event_count = []
for i in range(N):
tree_height.append(np.random.rayleigh(scale=10))
wind_speed.append(np.random.rayleigh(scale=5))
event_count.append(np.random.poisson(tree_height[-1] + wind_speed[-1] + 0.1*tree_height[-1]*wind_speed[-1]))
# Normalize the synthetic covariates to be zero mean, unit variance
mn_tree_height = np.mean(tree_height)
std_tree_height = np.std(tree_height)
mn_wind_speed = np.mean(wind_speed)
std_wind_speed = np.std(wind_speed)
tree_height = (tree_height-mn_tree_height) / std_tree_height
wind_speed = (wind_speed-mn_wind_speed) / mn_wind_speed
# Build the data frame
df = pd.DataFrame.from_dict({'tree_height': tree_height, 'wind_speed': wind_speed, 'event_count': event_count})
# Patsy model specification
fml = 'event_count ~ tree_height * wind_speed'
# Design matrices
(outcome,covars) = pt.dmatrices(fml, df, return_type='dataframe', NA_action='raise')
# Theano shared variables for mini-batch training and testing
wind_speed = shared(covars.wind_speed.values)
tree_height = shared(covars.tree_height.values)
ws_th = shared(covars['tree_height:wind_speed'].values)
# PyMC3 model specification
with pm.Model() as m:
b0 = pm.Normal('b0_intercept', mu=0, sigma=10)
b1 = pm.Normal('b1_wind_speed', mu=0, sigma=10)
b2 = pm.Normal('b2_tree_height', mu=0, sigma=10)
b3 = pm.Normal('b3_tree_height:wind_speed', mu=0, sigma=10)
theta = (b0 +
b1 * wind_speed +
b2 * tree_height +
b3 * ws_th)
y = pm.Poisson('outcome', mu=np.exp(theta), observed=outcome['event_count'].values)