1

This is an extension of my previous post here Estimating parameters using stan when the distribution for response variable in a regression is non-normal.

Let say I have below data

dat = list(y = c(0.00792354094929414, 0.00865300734292492, 0.0297400780486734, 
0.0196358416326437, 0.00239020640762042, 0.0258055591736283, 
0.17394835142698, 0.156463554455613, 0.329388185725557, 0.00764435088817635, 
0.0162081480398152, 0, 0.00157591399416963, 0.420025972703085, 
0.000122623651944455, 0.133061480234834, 0.565454216154227, 0.000281973481299731, 
0.000559715156383041, 0.0270686389659072, 0.918300537689865, 
0.00000782624683025728, 0.00732414341919458, 0, 0, 0, 0, 0, 0, 
0, 0.174071274611405, 0.0432109713717948, 0.0544400838264943, 
0, 0.0907049925221286, 0.616680102647887, 0, 0), x = c(23.8187587698947, 
15.9991138359515, 33.6495930512881, 28.555818797764, -52.2967967248258, 
-91.3835208788233, -73.9830692708321, -5.16901145289629, 29.8363012310241, 
10.6820057903939, 19.4868517164395, 15.4499668436458, -17.0441644773509, 
10.7025053739577, -8.6382953428539, -32.8892974839165, -15.8671863161348, 
-11.237248036145, -7.37978020066205, -3.33500586334862, -4.02629933182873, 
-20.2413384726948, -54.9094885578775, -48.041459120976, -52.3125732905322, 
-35.6269065970458, -62.0296155423529, -49.0825017152659, -73.0574478287598, 
-50.9409090127938, -63.4650928035253, -55.1263264283842, -52.2841103768755, 
-61.2275334149805, -74.2175990067417, -68.2961107804698, -76.6834643609286, 
-70.16769103228), N = 38)

I want to fit a logit model on above data based on fractional response variable. Therefore, below is my stan model code

model = "
data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}

transformed data {
  vector[N] z = bernoulli_rng(y);
}

parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}

transformed parameters {
    vector[N] mu;
    mu = alpha + beta * x;
}

model {
  sigma ~ normal(0, 1);
  alpha ~ normal(0, 1);
  beta ~ normal(0, 1);
  z ~ bernoulli(mu);
}

"
sampling(stan_model(model_code = model), data = dat, chains = 4, iter = 50000, refresh = 0)

With this I am getting below error

SYNTAX ERROR, MESSAGE(S) FROM PARSER:
Variable definition base type mismatch, variable declared as base type vector variable definition has base type int[ ] error in 'model93e37bdec88_3b62e3bb17b9f3ed9c717c98aa6ca9ac' at line 9, column 32
  -------------------------------------------------
     7: 
     8: transformed data {
     9:   vector[N] z = bernoulli_rng(y);
                                       ^
    10: }
  -------------------------------------------------

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'object' in selecting a method for function 'sampling': failed to parse Stan model '3b62e3bb17b9f3ed9c717c98aa6ca9ac' due to the above error.

Could you please help me to find the correct specification of the stan model?

Bogaso
  • 2,838
  • 3
  • 24
  • 54
  • What do you mean with fractional response variable? Let's say your y is .5, would that mean that half of the outcomes were one? If so do you have the original counts for that? – LukasNeugebauer Mar 14 '22 at 09:03
  • Yes `y` can be considered as ratio of something which is reported by some system. Unfortunately I don't have total number of events as our data recording system does not report this. So basically, we are trying to predict the ratio (i.e. success) based on exogenous variable `x`. Another example, `Loss ratio` in an event of default for some Debt instrument – Bogaso Mar 14 '22 at 14:18
  • Then you can't do a logistic regression. If you had the counts you could do something like a logistic regression, but using a binomial likelihood instead of bernoulli. Like this, you can try to predict some unbounded transformation of `y`, like the logit, i.e. you could assume that `logit(y)` is normally distributed around the predicted values of `logit(y)` and do linear regression. – LukasNeugebauer Mar 14 '22 at 14:38
  • I am not using `Logistic regression`. I am trying to estimate something like https://en.wikipedia.org/wiki/Fractional_model. It may be the case that my set up of `stan model`in my original post is not correct. That is why I seek guidance on how can I setup `stan` model for this estimation – Bogaso Mar 14 '22 at 14:50
  • So what I suggested is the first approach, which won't work because you do have `0` in `y`. Since I'm not sure what the likelihood is supposed to be in the second case, I can't help here unfortunately. But I can tell you why you get the error message above. `bernoulli_rng` returns integer values, so you need `int z[N]` instead of `vector[N] z`. – LukasNeugebauer Mar 14 '22 at 15:00
  • In python, there is a package called `pymc3` to estimate this kind of model. But I am sure what is `R` equivalent for that package – Brian Smith Mar 14 '22 at 15:07

1 Answers1

1

There might be a deeper issue than how to model saturated probabilities (probabilities that either exactly 0 or exactly 1).

Here is a plot of your data. Visually there isn't much of a relationship between x and y.

library("tidyverse")

as_tibble(dat) %>%
  ggplot(
    aes(x, y)
  ) +
  geom_point() +
  scale_y_continuous(
    limits = c(0, 1)
  )

Created on 2022-03-13 by the reprex package (v2.0.1)

And things don't get better on the logit scale, ie, with the transformation z = logit(y).

library("tidyverse")

as_tibble(dat) %>%
  # The transformation maps the saturated probabilities to NA.
  mutate(
    z = qlogis(y)
  ) %>%
  # And ggplot drops the NAs.
  ggplot(
    aes(x, z)
  ) +
  geom_point()

Created on 2022-03-13 by the reprex package (v2.0.1)

dipetkov
  • 3,380
  • 1
  • 11
  • 19
  • Thanks for your comment. I have provided dummy data. My main question is how can the `stan` model be specified to fit `logit` model with this data where response variable is `fractional`. Note, typically in `logistic regression` we have sequence of `0, 1` as response variable. But in this case, I have fractions containing `0 and 1`. – Bogaso Mar 13 '22 at 12:02
  • Is your dummy data representative of your actual data? If not, how is the dummy data relevant to your question? – dipetkov Mar 13 '22 at 12:17
  • You could ask your question on the Stan forum: https://discourse.mc-stan.org/. It's more likely to get an answer from a Stan expert. – dipetkov Mar 13 '22 at 12:27
  • dummy data representative of my actual data, but qualitatively. Main characteristics of dummy data to be noted are 1) range of data for response variable is `0, 1` inclusive 2) range of data for explanatory variable is negative infinity to positive infinity – Bogaso Mar 13 '22 at 17:37