I have a dataset of body measurements for birds and I'm using the lda function from the MASS package to find out the extent of sexual dimorphism. Eventually, I want to end up with an equation and critical score that can be used in the field (without access to computers or R) to determine if the bird in hand is male or female. In our data set, there are more males than females. I don't know exactly why that is, but for now, I'm assuming this means there is a real reason why males are captured more often than females, though our dataset is only 34 birds so this might not be significant.
I know how to extract/determine the equation (following the instructions halfway down the page here: https://stats.stackexchange.com/questions/157772/how-to-find-the-line) but there is some overlap in the D-scores where the predict.lda function seems to go either way. I expected the critical D-score to be 0 but it's not...
I would like to know how I can find 1) the D-score where the model will always determine the bird is female (or male), 2) what the extent of the overlap is.
Mock code (with the real data there is more overlap):
set.seed(42)
train <- data.frame(sex = c(rep("F", 35), rep("M", 65)),
A = c(rnorm(35, 20, 2.5), rnorm(65, 15, 2.5)),
B = c(rnorm(35, 6, 0.2), rnorm(65, 5.5, 0.2)),
C = c(rnorm(35, 250, 5), rnorm(65, 240, 5)),
D = c(rnorm(35, 450, 25), rnorm(65, 350, 25)))
mod <- lda(sex ~ ., data = train)
mod
gm = mod$prior %*% mod$means # these are used to get the equation
const = drop(gm %*% mod$scaling)
#the equation is then: D = mod$scaling[1] * A + mod$scaling[2] * B + mod$scaling[3] * C + mod$scaling[4] * D - const
test <- data.frame(sex = c(rep("F", 350), rep("M", 650)),
A = rnorm(1000, gm[1], 2.5),
B = rnorm(1000, gm[2], 0.2),
C = rnorm(1000, gm[3], 5),
D = rnorm(1000, gm[4], 25))
pred <- data.frame(predict(mod, test)$x, class = predict(mod, test)$class)
I've Googled a lot and looked at several stack exchange and stack overflow questions, but I can't figure it out.