0

I have a data set with a binary variable[Yes/No] and a continuous variable (X). I'm trying to make a model to classify [Yes/No] X.

From my data set, when X = 0.5, 48% of the observations are Yes. However, I know the true probability for Yes should be 50% when X = 0.5. When I create a model using logistic regression X = 0.5 != P[Yes=0.5].

How can I correct this? I guess all probabilities should be slightly underestimated if it does not pass true the correct point.

Is it correct just to add a bunch of observations in my sample to adjust the proportion?

Does not have to be just logistic regression, LDA, QDA etc is also of interest.

I have searched Stack Overflow, but only found topics regarding linear regression.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
MLEN
  • 2,162
  • 2
  • 20
  • 36

2 Answers2

3

I believe that in R (assuming you're using glm from base R) you just need

glm(y~I(x-0.5)-1,data=your_data,family=binomial)

the I(x-0.5) recenters the covariate at 0.5, the -1 suppresses the intercept (intercept = 0 at x=0.5 -> probability = 0.5 at x=0.5).

For example:

set.seed(101)
dd <- data.frame(x=runif(100,0.5,1),y=rbinom(100,size=1,prob=0.7))
m1 <- glm(y~I(x-0.5)-1,data=dd,family=binomial)
predict(m1,type="response",newdata=data.frame(x=0.5)) ## 0.5
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • I tried this, but it gave the result that P[Yes = 0.5] when X about 0.55. Also X was higher than P[Yes] in the beginning and "switched" later on. I know that P[Yes] should always be higher than X, except when X = 0.5. – MLEN Dec 29 '16 at 18:27
  • Can this have to do because X only takes a value between 0.5 and 1? Will try and get a part of the data tomorrow and my code. – MLEN Dec 29 '16 at 18:30
  • Seems weird. I don't see what the range of `X` would have to do with it. A [mcve] would definitely be useful. – Ben Bolker Dec 29 '16 at 18:45
  • Forgot the family argument. Thanks for your solution – MLEN Jan 06 '17 at 16:47
2

The OP wrote:

How can I correct this? I guess all probabilities should be slightly underestimated if it does not pass true the correct point.

This is not true. It is perfectly possible to underestimate some values (like the intercept) and overestimate others.

An example following your situation:

The true probabilities:

set.seed(444)

true_prob <- function(x) {

  # logit probabilities
  lp <- (x - 0.5)

  # true probabilities
  p <- 1 / (1 + exp(-lp))
  p

}

true_prob(x = 0.5)
[1] 0.5

But if you simulate data and fit a model, the intercept could be underestimated and other values overestimated:

n <- 100
# simulated predictor
x <- runif(n, 0, 1)
probs <- true_prob(x)

# simulated binary response
y <- as.numeric(runif(n) < probs)

Now fit a model and compare true probabilities vs fitted ones:

> true_prob(0.5)
[1] 0.5
> predict(m, newdata = data.frame(x = 0.5), type = "response")
       1 
0.479328 
> true_prob(2)
[1] 0.8175745
> predict(m, newdata = data.frame(x = 2), type = "response")
        1 
0.8665702 

So in this example, model underestimates at x = 0.5 and overestimates at x = 2

davechilders
  • 8,693
  • 2
  • 18
  • 18