0

I'm trying to get into language processing, starting with simple HMMs but building out to models that may require discrete parameters (so Stan may not work). Ultimately, I want to process a lot of language data, so need considerable efficiency. PyMC3 seems like possibly my only option (am open to suggestions).

I've run into problems trying out a simple 'hidden' Markov model--the first HMM model in the Stan manual in which the 'latent' states are observed data. The model, w/ N=300 artificial data, runs in PyMC3 with Metropolis, but this takes 11 minutes compared to Stan's 2.5 seconds (using NUTS). Hundreds of times longer suggests the Metropolis sampler isn't adequately scalable. The NUTS sampler in PyMC3 gives the "Scaling is not positive definite" error, despite being fed what is a quite accurate MAP estimate.

I also tried running a model in PyMC3 with N=1000. I tried both the fmin_powell and L-BFGS-B optimizers. Both ran for two hours and crashed after using up all available memory on my system (16GB RAM, 16GB swap)--no idea why. Stan does not require explicit MAP estimates and completed the Bayesian analysis w/ N=1000 in about 20 seconds. If I read the Stan manual correctly, it simply starts with random values, not a MAP estimate.

I'm not clear whether my issue is due to some rookie flaw in the model I fed PyMC3, to some problem or limitations in PyMC3, or whether this is just not the kind of model meant to be addressed with PyMC3 (or perhaps ultimately Bayesian estimation). I have tried to further vectorize the model, which may help, but can't seem to figure out something that works nor have I seen much online to provide guidance.

All code I used to run the model, plus the artificial data used can be found here: https://drive.google.com/folderview?id=0B8242b1Xh6pzSWFDM2lQbDZwajg&usp=sharing

The model is:

with basic_model:
    #Model constants:
    K=3; V=9; T=300 #num unique hidden states, num unique emissions, num instances
    alpha=np.ones(K); beta=np.ones(V)
    # Priors for unknown model parameters
    theta = np.empty(K, dtype=object) #theta=transmission
    phi = np.empty(K, dtype=object) #phi=emission
    w=np.empty(T, dtype=object); z=np.empty(T-1, dtype=object); #observed emission, state
    for k in range(K):
        theta[k]=Dirichlet('theta'+str(k), alpha)
        phi[k]=Dirichlet('phi'+str(k), beta)
    #Likelihood (sampling distribution) of observationss
    for t in range(1, T):
        z[t-1]=Categorical('z'+str(t),theta[state[t-1]], shape=1,  observed=state[t])
    for t in range(T):
        w[t]=Categorical('w'+str(t),phi[state[t]], shape=1, observed=emission[t])
JasonK
  • 113
  • 7
  • If you really need to scale an HMM, you want to use a more dedicated package and compute penalized MLE or MAP estimates. By default, Stan uses a a random uniform(-2, 2) initialization on the unconstrained scale. The MAP estimate isn't always a good starting point, but it'd probably work for HMMs. Stan can compute the MAP estimate directly. – Bob Carpenter Sep 02 '16 at 11:20
  • Thanks Bob for your reply. I didn't notice anything in the Stan manual about actually computing MAP estimates w/ Stan, but your comment made me look harder and I found the 'optimizing' function in Rstan, which has proved very helpful and very fast. I wonder what else you may have in mind for a dedicated package for MLE / MAP estimates or HMM modeling? – JasonK Sep 07 '16 at 00:18
  • This is off the topic for StackOverflow, but we don't have any immediate plans for Stan for improving HMMs. What would be potentially useful would be to have custom HMM functions (e.g., forward-backward or custom derivatives for likelihood). The manual is for the language and algorithms, but doesn't cover how to call the algorithms from the interfaces---there are separate manuals for each interface. – Bob Carpenter Sep 08 '16 at 18:29

0 Answers0