Need to fix Stan code for the generalized pareto distribution to take in real arguments rather than vectors

Question

I am using the functions defined here: Extreme value analysis and user defined probability functions in Stan for modeling the data with a generalized pareto distribution, but my problem is that my model is in a for-loop and expects three real valued arguments, whereas, the gpd functions assume a vector, real, real argument.

I’m not so sure that my model chunk is so amenable to being vectorized, and so I was thinking I would need to have the gpd functions take in real valued arguments (but maybe I’m wrong).

I’d appreciate any help with switching the code around to achieve this. Here is my stan code

functions {
  real gpareto_lpdf(vector y, real k, real sigma) {
    // generalised Pareto log pdf 
    int N = rows(y);
    real inv_k = inv(k);
    if (k<0 && max(y)/sigma > -inv_k)
      reject("k<0 and max(y)/sigma > -1/k; found k, sigma =", k, sigma)
    if (sigma<=0)
      reject("sigma<=0; found sigma =", sigma)
    if (fabs(k) > 1e-15)
      return -(1+inv_k)*sum(log1p((y) * (k/sigma))) -N*log(sigma);
    else
      return -sum(y)/sigma -N*log(sigma); // limit k->0
  }

  real gpareto_lcdf(vector y, real k, real sigma) {
    // generalised Pareto log cdf
    real inv_k = inv(k);
    if (k<0 && max(y)/sigma > -inv_k)
      reject("k<0 and max(y)/sigma > -1/k; found k, sigma =", k, sigma)
    if (sigma<=0)
      reject("sigma<=0; found sigma =", sigma)
    if (fabs(k) > 1e-15)
      return sum(log1m_exp((-inv_k)*(log1p((y) * (k/sigma)))));
    else
      return sum(log1m_exp(-(y)/sigma)); // limit k->0
  }
}

data {
  // the input data
  int<lower = 1> n;                       // number of observations
  real<lower = 0> value[n];              // value measurements
  int<lower = 0, upper = 1> censored[n];  // vector of 0s and 1s

  // parameters for the prior
  real<lower = 0> a;
  real<lower = 0> b;
}

parameters {
  real k;
  real sigma;
}

model {
 // prior
 k ~ gamma(a, b);
 sigma ~ gamma(a,b);

 // likelihood
  for (i in 1:n) {
    if (censored[i]) {
      target += gpareto_lcdf(value[i] | k, sigma);  
    } else {
      target += gpareto_lpdf(value[i] | k, sigma);
    }
  }
}

I think it would be easier and faster to change the likelihood to accept a vector. A typical strategy is to add to the data block a vector giving the indices of the censored/uncensored obs e.g. `int cens_id[n]` and set this to (in r code) `which(censored)`. Then in the likelihood skip the loop and just write `target += gpareto_lcdf(value[cens_id] | k, sigma)` and then `target += gpareto_lpdf(value[notcens_id] | k, sigma)`. Saves on computation and re-writing the distribution functions. — gfgm, Feb 22 '22 at 10:22
@gfgm would you mind writing this up as an answer? I'm running into errors implementing it. — John Smith, Feb 22 '22 at 20:24
my comment wasn't very clear and has a bug (cens_id[n] should be cens_id[n_cens]) so apologies. I put a working version in an answer below — gfgm, Feb 22 '22 at 21:26

Martin C. Arnold · Answer 1 · 2022-02-22T22:25:51.287

Here is how the log PDF could be adapted. This way, index arrays for subsetting y into censored and non-censored observations can be passed.

real cens_gpareto_lpdf(vector y, int[] cens, int[] no_cens, real k, real sigma) {
    
  // generalised Pareto log pdf 
  int N = size(cens);
  real inv_k = inv(k);
  if (k<0 && max(y)/sigma > -inv_k)
  reject("k<0 and max(y)/sigma > -1/k; found k, sigma =", k, sigma)

    if (fabs(k) > 1e-15)
      return -(1+inv_k)*sum(log1p((y[no_cens]) * (k/sigma))) -N*log(sigma) + 
      sum(log1m_exp((-inv_k)*(log1p((y[cens]) * (k/sigma)))));
    else
      return -sum(y[no_cens])/sigma -N*log(sigma) + 
      sum(log1m_exp(-(y[cens])/sigma));
}

Extend the data block: n_cens, n_not_cens, cens, and no_cens are values that need to supplied.

  int<lower = 1> n;               // total number of obs
  int<lower = 1> n_cens;          // number of censored obs
  int<lower = 1> n_not_cens;      // number of regular obs
  
  int cens[n_cens];               // index set censored
  int no_cens[n_not_cens];        // index set regular
  
  vector<lower = 0>[n] value;     // value measurements

Nonzero Parameters as suggested by gfgm:

parameters {
  real<lower=0> k;
  real<lower=0> sigma;
}

Rewrite the model block:

model {
  // prior
  k ~ gamma(a, b);
  sigma ~ gamma(a,b);
  // likelihood
  value ~ cens_gpareto(cens, no_cens, k, sigma);
}

Disclaimer: I neither checked the formulas for sanity nor ran the model using test data. Just compiled via rstan::stan_model() which worked fine. gfgm's suggestion may be more convenient for post-processing / computing stuff in generated quantities etc. I'm not a Stan expert :-).

Edit:

Fixed divergence issue found by gfgm through simulation. The likelihood was ill-defined (N= rows(y) instead of N=size(cens). Runs fine now with gfgm's data (using set.seed(123) and rstan):

         mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
k        0.16    0.00 0.10    0.02    0.08    0.14    0.21    0.42  1687    1
sigma    0.90    0.00 0.12    0.67    0.82    0.90    0.99    1.16  1638    1
lp__  -106.15    0.03 1.08 -109.09 -106.56 -105.83 -105.38 -105.09  1343    1

Yes. Then cens would have length and y[cens] would fail. This is why I set the lower bound to 1. — Martin C. Arnold, Feb 22 '22 at 21:12

gfgm · Accepted Answer · 2022-02-22T22:08:17.183

You can get the model to accept a vector and avoid a for loop. In part you need to change the signature when you declare value but then also feed in as data the indices of the censored and uncensored observations.

I'm posting below code that runs on made up data, but the data I made up is just some random log-normal variates and not actual draws from a pareto so I have no idea if the model is recovering the parameters or what its coverage looks like. You'll probably want to do a bit of simulation based calibration to check the model.

The stan code:

functions {
  real gpareto_lpdf(vector y, real k, real sigma) {
    // generalised Pareto log pdf 
    int N = rows(y);
    real inv_k = inv(k);
    if (k<0 && max(y)/sigma > -inv_k)
      reject("k<0 and max(y)/sigma > -1/k; found k, sigma =", k, sigma);
    if (sigma<=0)
      reject("sigma<=0; found sigma =", sigma);
    if (fabs(k) > 1e-15)
      return -(1+inv_k)*sum(log1p((y) * (k/sigma))) -N*log(sigma);
    else
      return -sum(y)/sigma -N*log(sigma); // limit k->0
  }

  real gpareto_lcdf(vector y, real k, real sigma) {
    // generalised Pareto log cdf
    real inv_k = inv(k);
    if (k<0 && max(y)/sigma > -inv_k)
      reject("k<0 and max(y)/sigma > -1/k; found k, sigma =", k, sigma);
    if (sigma<=0)
      reject("sigma<=0; found sigma =", sigma);
    if (fabs(k) > 1e-15)
      return sum(log1m_exp((-inv_k)*(log1p((y) * (k/sigma)))));
    else
      return sum(log1m_exp(-(y)/sigma)); // limit k->0
  }
}

data {
  // the input data
  int<lower = 1> n;                       // number of observations
  int<lower = 0> n_cens;
  vector<lower = 0>[n] value;              // value measurements
  int<lower=1> cens_id[n_cens]; // the indices of censored observations
  int<lower=1> nocens_id[n - n_cens]; // the indices of uncensored obs

  // parameters for the prior
  real<lower = 0> a;
  real<lower = 0> b;
}

parameters {
  real<lower=0> k;
  real<lower=0> sigma;
}

model {
 // prior
 k ~ gamma(a, b);
 sigma ~ gamma(a,b);

 // likelihood
 target += gpareto_lcdf(value[cens_id] | k, sigma);  
 target += gpareto_lpdf(value[nocens_id] | k, sigma);
}

This runs basically instantaneously on my mid-tier laptop using cmdstanr. You can run it with the R code

library(cmdstanr)
library(bayesplot)

gparet_mod <- cmdstan_model("gpareto_example.stan")

# make some fake data. This won't be correct and its
# no proper test of the capacity of the model to 
# recover the parameters, just seeing if it runs

N <- 100
value <- rlnorm(N)
censored <- rbinom(N, 1, .5)
N_cens <- sum(censored)
cens_id <- which(censored == 1)
nocens_id <- which(censored == 0)

a <- b <- 2

dat <- list(n = N, n_cens = N_cens, value = value,
            cens_id = cens_id, nocens_id = nocens_id,
            a = a, b = b)

ests <- gparet_mod$sample(data = dat, parallel_chains = 2, chains = 2)

This produces what appear to be sane samples

ests$summary()
# A tibble: 3 × 10
  variable     mean   median    sd   mad       q5      q95  rhat ess_bulk ess_tail
  <chr>       <dbl>    <dbl> <dbl> <dbl>    <dbl>    <dbl> <dbl>    <dbl>    <dbl>
1 lp__     -103.    -103.    1.10  0.767 -105.    -102.     1.00     750.     807.
2 k           0.316    0.296 0.135 0.131    0.128    0.557  1.00     665.     570.
3 sigma       0.691    0.685 0.113 0.110    0.511    0.887  1.00     772.     887.

And posterior distributions

mcmc_hist(ests$draws(variables = c("k", "sigma")))

I think the approach of Martin Arnold is a bit more elegant, but when I tried to get it to run on the same data the sampler broke down with divergences, so maybe I ported the model wrong or something needs tweaking.

Edit

Given the likelihood functions it makes more sense to constrain the parameters k and sigma to be positive. The model runs better and you'll get fewer error messages. I modified the stan code above to reflect this.

Thanks for pointing out the divergence issue! There was an error in the likelihood function (see my edit). Also overlooked that k and sigma should be positive! — Martin C. Arnold, Feb 22 '22 at 22:21

Need to fix Stan code for the generalized pareto distribution to take in real arguments rather than vectors

2 Answers2

Edit