R: Determine the threshold that maximally separates two groups based on a continuous variable?

Question

Say I have 200 subjects, 100 in group A and 100 in group B, and for each I measure some continuous parameter.

require(ggplot2)
set.seed(100)

value <- c(rnorm(100, mean = 5, sd = 3), rnorm(100, mean = 10, sd = 3))
group <- c(rep('A', 100), rep('B', 100))

data <- data.frame(value, group)

ggplot(data = data, aes(x = value)) +
  geom_bar(aes(color = group))

I would like to determine the value (Threshold? Breakpoint?) that maximizes separation and minimizes misclassification between the groups. Does such a function exist in R?

I've tried searching along the lines of "r breakpoint maximal separation between groups," and "r threshold minimize misclassification," but my google-foo seems to be off today.

EDIT:

Responding to @Thomas's comment, I have tried to fit the data using logistic regression and then solve for the threshold, but I haven't gotten very far.

lr <- glm(group~value)
coef(lr)
# (Intercept)       value 
# 1.1857435  -0.0911762

So Bo = 1.1857435 and B1 = -0.0911762

From Wikipedia, I see that F(x) = 1/(1+e^-(Bo + B1x)), and solving for x:

x = (ln(F(x) / (1 - F(x))) - Bo)/B1

But trying this in R, I get an obviously incorrect answer:

(log(0.5/(1 - 0.5)) - 1.1857435)/-0.0911762 # 13.00497

I know of no such function, still if you know how to calculate a "Breakpoint" or "Threshold" you should be able to write a function yourself. — CCurtis, Apr 10 '14 at 02:47
I'm not sure I totally understand your question (specifically, do you re-assign groups based on the calculated threshold?), but I wonder if you've looked at clustering methods, like the kmeans function in the stats library? — Jesse Anderson, Apr 10 '14 at 04:54
I think you're looking for a binary classifier. Logistic regression is one approach. Estimate the model and solve for the threshold value. Your data puts that somewhere in the range of 7.7. — Thomas, Apr 10 '14 at 09:33
@blindJesse: No, I haven't--that looks like it could be very helpful when I will need to do this sort of thing for more than a single variable. Thanks! — sudo make install, Apr 10 '14 at 10:39
@Thomas: Thanks, I think this is the closest to what I need for this example. I have edited my question trying this approach, but didn't get anything close to your answer. Do you see where I am going wrong? — sudo make install, Apr 10 '14 at 11:05

score 3 · Answer 1 · answered Apr 10 '14 at 11:14

A simple approach is to write a function that calculates the accuracy given a threshold:

accuracy = Vectorize(function(th) mean(c("A", "B")[(value > th) + 1] == group))

Then find the maximum using optimize:

optimize(accuracy, c(min(value), max(value)), maximum=TRUE)
# $maximum
# [1] 8.050888
# 
# $objective
# [1] 0.86

score 1 · Accepted Answer · answered Apr 10 '14 at 14:22

I've gotten the answer I need thanks to help from @Thomas and @BenBolker.

Summary

The problem with my attempt at solving it through logistic regression was that I hadn't specified family = binomial
The dose.p() function in MASS will do the work for me given a glm fit

Code

# Include libraries
require(ggplot2)
require(MASS)

# Set seed
set.seed(100)

# Put together some dummy data
value <- c(rnorm(100, mean = 5, sd = 3), rnorm(100, mean = 10, sd = 3))
group <- c(rep(0, 100), rep(1, 100))
data <- data.frame(value, group)

# Plot the distribution -- visually
# The answer appears to be b/t 7 and 8
ggplot(data = data, aes(x = value)) +
  geom_bar(aes(color = group))

# Fit a glm model, specifying the binomial distribution
my.glm <- glm(group~value, data = data, family = binomial)
b0 <- coef(my.glm)[[1]]
b1 <- coef(my.glm)[[2]]

# See what the probability function looks like
lr <- function(x, b0, b1) {
  prob <- 1 / (1 + exp(-1*(b0 + b1*x)))
  return(prob)                  
}

# The line appears to cross 0.5 just above 7.5
x <- -0:12
y <- lr(x, b0, b1)
lr.val <- data.frame(x, y)
ggplot(lr.val, aes(x = x, y = y)) +
  geom_line()

# The inverse of this function computes the threshold for a given probability
inv.lr <- function(p, b0, b1) {
  x <- (log(p / (1 - p)) - b0)/b1
  return(x)
}

# With the betas from this function, we get 7.686814
inv.lr(0.5, b0, b1)

# Or, feeding the glm model into dose.p from MASS, we get the same answer
dose.p(my.glm, p = 0.5)

Thanks, everyone, for your help!

R: Determine the threshold that maximally separates two groups based on a continuous variable?

2 Answers2

Linked