Multi Threading in C++ Windows

Question

I am using the pystan module in Windows where multithreading is not supported on Windows in the module. The pystan module is partially written in C++ and since I am trying to decrease the run time of the module, I am wondering if there is a way to hand write a multi-threading code in the C++ part of the module to decrease run time so I can increase the iterations? Below is the code:

from __future__  import division
import pystan
import numpy as np
import os 

x=np.array([453.05,453.05,453.24,453.35,453.44,453.44,453.83,454.02,454.89])
y=np.array([3232.12,3231.45,3231.90,3231.67,3231.84,3231.95,3231.89,3231.67,3231.45])
x=np.array(zip(x,y))
c=np.array([0.01,0.07,0.001,0.1,0.05,0.001,0.001,0.05,0.001])
s = np.array([454.4062631951059,3230.808656891571])
st=np.array([12,12,12,12,12,12,12,12,12])

model='''
    data {
     int D; //number of dimensions
     int K; //number of gaussians
     int N; //number of data 

     vector[D] y[N]; // observation data
     real con[N]; //concentration
     vector[D] s;//oil spill location
     real st[N]; // sample time
    }

    parameters {
     simplex[K] theta; //mixing proportions
     vector[D] v[K];
     vector<lower=0>[D] Dif[K];
     cholesky_factor_corr[D] L[K]; //cholesky factor of correlation matrix
    }

    transformed parameters {
      cholesky_factor_cov[D,D] cov[K,N];
      vector<lower=0>[D] sigma[K,N]; // standard deviations  
      vector[D] mu[K,N];
      real ro[K];
      matrix[D,D] Omega[K];
      matrix[D,D] Sigma[K,N];  
      vector[N] lamba;  

      for (k in 1:K) {  
      Omega[k] = multiply_lower_tri_self_transpose(L[k]);
         for (n in 1:N){
          sigma[k,n] = 0.05 + sqrt(2*st[n]*Dif[k]);
          mu[k,n] = s+v[k]*st[n];
          cov[k,n] = diag_pre_multiply(sigma[k,n],L[k]);
      Sigma[k,n] = quad_form_diag(Omega[k], sigma[k,n]); 
         }
      ro[k]=Omega[k,2,1]; 
      }

      for (i in 1 : N) {lamba[i] = 1/(theta[1]*(1./2./3.1415926/sqrt (Sigma[1,i, 1, 1])/sqrt (Sigma[1,i, 2, 2])/sqrt (1 - ro[1]*ro[1]))*exp (-1./2./(1 - ro[1]*ro[1])*(-(y[i, 1] - mu[1,i, 1])*(y[i, 1] - mu[1,i, 1])/Sigma[1, i,1, 1] - (y[i, 2] - mu[1, i,2])*(y[i, 2] - mu[1, i,2])/Sigma[1,i, 2, 2] + 2.*ro[1]*(y[i, 1] - mu[1,i, 1])*(y[i, 2] - mu[1,i, 2])/sqrt (Sigma[1, i,1, 1])/sqrt (Sigma[1,i, 2, 2]))) + 
           theta[2]*(1./2./3.1415926/sqrt (Sigma[2, i,1, 1])/sqrt (Sigma[2,i, 2, 2])/sqrt (1 - ro[2]*ro[2]))*exp (-1./2./(1 - ro[2]*ro[2])*(-(y[i, 1] - mu[2, i,1])*(y[i, 1] - mu[2, i,1])/Sigma[2, i,1, 1] - (y[i, 2] - mu[2,i, 2])*(y[i, 2] - mu[2, i,2])/Sigma[2,i, 2, 2] + 2.*ro[2]*(y[i, 1] - mu[2, i,1])*(y[i, 2] - mu[2, i,2])/sqrt (Sigma[2, i,1, 1])/sqrt (Sigma[2, i,2, 2]))) +
           theta[3]*(1./2./3.1415926/sqrt (Sigma[3, i,1, 1])/sqrt (Sigma[3,i, 2, 2])/sqrt (1 - ro[3]*ro[3]))*exp (-1./2./(1 - ro[3]*ro[3])*(-(y[i, 1] - mu[3, i,1])*(y[i, 1] - mu[3, i,1])/Sigma[3, i,1, 1] - (y[i, 2] - mu[3,i, 2])*(y[i, 2] - mu[3, i,2])/Sigma[3,i, 2, 2] + 2.*ro[3]*(y[i, 1] - mu[3, i,1])*(y[i, 2] - mu[3, i,2])/sqrt (Sigma[3, i,1, 1])/sqrt (Sigma[3, i,2, 2]))) +
           theta[4]*(1./2./3.1415926/sqrt (Sigma[4, i,1, 1])/sqrt (Sigma[4,i, 2, 2])/sqrt (1 - ro[4]*ro[4]))*exp (-1./2./(1 - ro[4]*ro[4])*(-(y[i, 1] - mu[4, i,1])*(y[i, 1] - mu[4, i,1])/Sigma[4, i,1, 1] - (y[i, 2] - mu[4,i, 2])*(y[i, 2] - mu[4, i,2])/Sigma[4,i, 2, 2] + 2.*ro[4]*(y[i, 1] - mu[4, i,1])*(y[i, 2] - mu[4, i,2])/sqrt (Sigma[4, i,1, 1])/sqrt (Sigma[4, i,2, 2]))));}
    }

    model {
     real ps[K];
     theta ~ dirichlet(rep_vector(2.0, 4));
     for(k in 1:K){
     v[k,1] ~ normal(0.0,4.1);// uniform(340/100,380/100);//
     v[k,2] ~  normal(0.0,4.1);//uniform(3160/100,3190/100);//
     Dif[k] ~ normal(0.5,0.2);//exponential(0.05);//beta(2,5);
     L[k] ~ lkj_corr_cholesky(2);// contain rho 
     con ~ exponential(lamba);
     }

     for (n in 1:N){
     for (k in 1:K){
      ps[k] = log(theta[k])+multi_normal_cholesky_lpdf(y[n] | mu[k,N], cov[k,N]); //increment log probability of the gaussian
     }
     target += log_sum_exp(ps);
     }
       for(i in 1:N){
       target +=   - lamba[i]*con[i]+log(lamba[i]);
      }
    }
    '''

    dat={'D':2,'K':4,'N':9,'y':x,'con':c,'s':s,'st':st}
    fit = pystan.stan(model_code=model,data=dat,iter=1000,warmup=500, chains=1,init_r=0.5)
    print(fit)

I'm not very proficient in C++ since I have been using python and the pystan module requires the code to be written in C++. I hope there is a way to multithread the number of iterations for the different cores on my Windows.

Isolate code/data which can be run in paralell and move that code to a thread and profile it to see if you can gain something. — user743414, Dec 04 '18 at 11:02
First, please limit the code you post to relevant portions - many of the imports are unused and distract from the question. Second, while the Stan backend and CmdStan are *written* in C++, the Stan probabilistic programming language is not C++ but instead a proper language on it's own that happens to have a C-like syntax. Finally, the question is unclear whether you're asking about running multiple chains or trying to parallelize a single chain. — merv, Dec 04 '18 at 17:33
@merv My apologies. I am asking about parallelizing a single chain. I know parallelizing a single chain is not supported in the stan module on Windows so I am wondering if there is a way I can write a code in the C-like language or python that can parallelize the chain. — nik jacks, Dec 04 '18 at 18:03

merv · Accepted Answer · 2018-12-04T23:43:11.110

Since Stan is its own language, you can only achieve what the compiler is designed to parse and emit code for, which does not include support for arbitrary C++ code.

The backend of Stan does provide support for parallelization of single chains via MPI, but as you correctly note, this unfortunately doesn't extend to Windows at this time. Other than trying to compiling your own versions of the backends that somehow leverage the MPI libraries, there's not really anything you can do in the modeling language per se to remedy this.

If you are on a non-Windows platform, then you can start taking advantage of the new parallelized math library through a map-reduce operation map_rect. The User Guide (see 22. Map-Reduce, pp 237-244) has some details on using the method, with examples.

Note that you can still use this method on any platform with a v2.18+ Stan backend installed, but only on Linux and Mac OS X will MPI get used. I'm not sure what the plans are to roll out MPI support on Windows, but it might be worth keeping an eye on this MPI Parallelism wiki page.

Update: Possible Experimental Threading Support

It seems some have been playing around with compiling the Stan backend with the prerelease version of RTools, which includes version 8.x of g++ and that can enable threaded math libraries and MPI. Sounds like a rabbit hole, but if you want to try the red pill here it is.

Multi Threading in C++ Windows

1 Answers1

Update: Possible Experimental Threading Support