Estimating parameters using stan when the distribution for response variable in a regression is non-normal - Part 2

Question

This is an extension of my previous post here Estimating parameters using stan when the distribution for response variable in a regression is non-normal.

Let say I have below data

dat = list(y = c(0.00792354094929414, 0.00865300734292492, 0.0297400780486734, 
0.0196358416326437, 0.00239020640762042, 0.0258055591736283, 
0.17394835142698, 0.156463554455613, 0.329388185725557, 0.00764435088817635, 
0.0162081480398152, 0, 0.00157591399416963, 0.420025972703085, 
0.000122623651944455, 0.133061480234834, 0.565454216154227, 0.000281973481299731, 
0.000559715156383041, 0.0270686389659072, 0.918300537689865, 
0.00000782624683025728, 0.00732414341919458, 0, 0, 0, 0, 0, 0, 
0, 0.174071274611405, 0.0432109713717948, 0.0544400838264943, 
0, 0.0907049925221286, 0.616680102647887, 0, 0), x = c(23.8187587698947, 
15.9991138359515, 33.6495930512881, 28.555818797764, -52.2967967248258, 
-91.3835208788233, -73.9830692708321, -5.16901145289629, 29.8363012310241, 
10.6820057903939, 19.4868517164395, 15.4499668436458, -17.0441644773509, 
10.7025053739577, -8.6382953428539, -32.8892974839165, -15.8671863161348, 
-11.237248036145, -7.37978020066205, -3.33500586334862, -4.02629933182873, 
-20.2413384726948, -54.9094885578775, -48.041459120976, -52.3125732905322, 
-35.6269065970458, -62.0296155423529, -49.0825017152659, -73.0574478287598, 
-50.9409090127938, -63.4650928035253, -55.1263264283842, -52.2841103768755, 
-61.2275334149805, -74.2175990067417, -68.2961107804698, -76.6834643609286, 
-70.16769103228), N = 38)

I want to fit a logit model on above data based on fractional response variable. Therefore, below is my stan model code

model = "
data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}

transformed data {
  vector[N] z = bernoulli_rng(y);
}

parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}

transformed parameters {
    vector[N] mu;
    mu = alpha + beta * x;
}

model {
  sigma ~ normal(0, 1);
  alpha ~ normal(0, 1);
  beta ~ normal(0, 1);
  z ~ bernoulli(mu);
}

"
sampling(stan_model(model_code = model), data = dat, chains = 4, iter = 50000, refresh = 0)

With this I am getting below error

SYNTAX ERROR, MESSAGE(S) FROM PARSER:
Variable definition base type mismatch, variable declared as base type vector variable definition has base type int[ ] error in 'model93e37bdec88_3b62e3bb17b9f3ed9c717c98aa6ca9ac' at line 9, column 32
  -------------------------------------------------
     7: 
     8: transformed data {
     9:   vector[N] z = bernoulli_rng(y);
                                       ^
    10: }
  -------------------------------------------------

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'object' in selecting a method for function 'sampling': failed to parse Stan model '3b62e3bb17b9f3ed9c717c98aa6ca9ac' due to the above error.

Could you please help me to find the correct specification of the stan model?

What do you mean with fractional response variable? Let's say your y is .5, would that mean that half of the outcomes were one? If so do you have the original counts for that? — LukasNeugebauer, Mar 14 '22 at 09:03
Yes `y` can be considered as ratio of something which is reported by some system. Unfortunately I don't have total number of events as our data recording system does not report this. So basically, we are trying to predict the ratio (i.e. success) based on exogenous variable `x`. Another example, `Loss ratio` in an event of default for some Debt instrument — Bogaso, Mar 14 '22 at 14:18
Then you can't do a logistic regression. If you had the counts you could do something like a logistic regression, but using a binomial likelihood instead of bernoulli. Like this, you can try to predict some unbounded transformation of `y`, like the logit, i.e. you could assume that `logit(y)` is normally distributed around the predicted values of `logit(y)` and do linear regression. — LukasNeugebauer, Mar 14 '22 at 14:38
I am not using `Logistic regression`. I am trying to estimate something like https://en.wikipedia.org/wiki/Fractional_model. It may be the case that my set up of `stan model`in my original post is not correct. That is why I seek guidance on how can I setup `stan` model for this estimation — Bogaso, Mar 14 '22 at 14:50
So what I suggested is the first approach, which won't work because you do have `0` in `y`. Since I'm not sure what the likelihood is supposed to be in the second case, I can't help here unfortunately. But I can tell you why you get the error message above. `bernoulli_rng` returns integer values, so you need `int z[N]` instead of `vector[N] z`. — LukasNeugebauer, Mar 14 '22 at 15:00
In python, there is a package called `pymc3` to estimate this kind of model. But I am sure what is `R` equivalent for that package — Brian Smith, Mar 14 '22 at 15:07

score 1 · Answer 1 · answered Mar 13 '22 at 10:12

1

There might be a deeper issue than how to model saturated probabilities (probabilities that either exactly 0 or exactly 1).

Here is a plot of your data. Visually there isn't much of a relationship between x and y.

library("tidyverse")

as_tibble(dat) %>%
  ggplot(
    aes(x, y)
  ) +
  geom_point() +
  scale_y_continuous(
    limits = c(0, 1)
  )

^{Created on 2022-03-13 by the reprex package (v2.0.1)}

And things don't get better on the logit scale, ie, with the transformation z = logit(y).

library("tidyverse")

as_tibble(dat) %>%
  # The transformation maps the saturated probabilities to NA.
  mutate(
    z = qlogis(y)
  ) %>%
  # And ggplot drops the NAs.
  ggplot(
    aes(x, z)
  ) +
  geom_point()

^{Created on 2022-03-13 by the reprex package (v2.0.1)}

answered Mar 13 '22 at 10:12

dipetkov

3,380
1
11
19

Thanks for your comment. I have provided dummy data. My main question is how can the `stan` model be specified to fit `logit` model with this data where response variable is `fractional`. Note, typically in `logistic regression` we have sequence of `0, 1` as response variable. But in this case, I have fractions containing `0 and 1`. – Bogaso Mar 13 '22 at 12:02
Is your dummy data representative of your actual data? If not, how is the dummy data relevant to your question? – dipetkov Mar 13 '22 at 12:17
You could ask your question on the Stan forum: https://discourse.mc-stan.org/. It's more likely to get an answer from a Stan expert. – dipetkov Mar 13 '22 at 12:27
dummy data representative of my actual data, but qualitatively. Main characteristics of dummy data to be noted are 1) range of data for response variable is `0, 1` inclusive 2) range of data for explanatory variable is negative infinity to positive infinity – Bogaso Mar 13 '22 at 17:37

Estimating parameters using stan when the distribution for response variable in a regression is non-normal - Part 2

1 Answers1