I'm trying to get into language processing, starting with simple HMMs but building out to models that may require discrete parameters (so Stan may not work). Ultimately, I want to process a lot of language data, so need considerable efficiency. PyMC3 seems like possibly my only option (am open to suggestions).
I've run into problems trying out a simple 'hidden' Markov model--the first HMM model in the Stan manual in which the 'latent' states are observed data. The model, w/ N=300 artificial data, runs in PyMC3 with Metropolis, but this takes 11 minutes compared to Stan's 2.5 seconds (using NUTS). Hundreds of times longer suggests the Metropolis sampler isn't adequately scalable. The NUTS sampler in PyMC3 gives the "Scaling is not positive definite" error, despite being fed what is a quite accurate MAP estimate.
I also tried running a model in PyMC3 with N=1000. I tried both the fmin_powell and L-BFGS-B optimizers. Both ran for two hours and crashed after using up all available memory on my system (16GB RAM, 16GB swap)--no idea why. Stan does not require explicit MAP estimates and completed the Bayesian analysis w/ N=1000 in about 20 seconds. If I read the Stan manual correctly, it simply starts with random values, not a MAP estimate.
I'm not clear whether my issue is due to some rookie flaw in the model I fed PyMC3, to some problem or limitations in PyMC3, or whether this is just not the kind of model meant to be addressed with PyMC3 (or perhaps ultimately Bayesian estimation). I have tried to further vectorize the model, which may help, but can't seem to figure out something that works nor have I seen much online to provide guidance.
All code I used to run the model, plus the artificial data used can be found here: https://drive.google.com/folderview?id=0B8242b1Xh6pzSWFDM2lQbDZwajg&usp=sharing
The model is:
with basic_model:
#Model constants:
K=3; V=9; T=300 #num unique hidden states, num unique emissions, num instances
alpha=np.ones(K); beta=np.ones(V)
# Priors for unknown model parameters
theta = np.empty(K, dtype=object) #theta=transmission
phi = np.empty(K, dtype=object) #phi=emission
w=np.empty(T, dtype=object); z=np.empty(T-1, dtype=object); #observed emission, state
for k in range(K):
theta[k]=Dirichlet('theta'+str(k), alpha)
phi[k]=Dirichlet('phi'+str(k), beta)
#Likelihood (sampling distribution) of observationss
for t in range(1, T):
z[t-1]=Categorical('z'+str(t),theta[state[t-1]], shape=1, observed=state[t])
for t in range(T):
w[t]=Categorical('w'+str(t),phi[state[t]], shape=1, observed=emission[t])