I'm using the predict.hurdle
function from the pscl
package to estimate the probabilities of observing 0, 1, 2, ..., N events in a data set.
Using the example in ?predict.hurdle
:
data("bioChemists", package = "pscl")
fm_hp1 <- hurdle(art ~ ., data = bioChemists)
summary(fm_hp1)
head(predict(fm_hp1, newdata = bioChemists, type = "prob"))
# returns a matrix of probabilities too large to show here
Each row of this matrix is an observation and each column is the probability of that count, 0-19 in this case.
summary(rowSums(predict(fm_hp1, newdata = bioChemists, type = "prob")))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9998 1.0000 1.0000 1.0000 1.0000 1.0000
But some rows don't sum to 1 as they should. But okay, they're close so maybe it isn't an issue....
But, I need to calibrate the intercept terms. "Calibration" in my industry is an acceptable way of saying "change the estimated parameters". Yes, I know there are many reasons for why this is not a good idea statistically (intentionally biasing estimates). However, I would still expect the code to work and the prediction to adhere to the rules of probability.
# Change the count model intercept
fm_hp1$coefficients$count["(Intercept)"] <- 3
summary(rowSums(predict(fm_hp1, newdata = bioChemists, type = "prob")))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.001521 0.434300 0.647400 0.602000 0.818400 0.983900
Now we see some major problems with the resultant probabilities.
I'm tempted to simply re-normalize these utilities on a 0-1 scale via:
old.p <- predict(fm_hp1, newdata = bioChemists, type = "prob")
new.p <- t(apply(X = old.p, MARGIN = 1, FUN = function(x) x/sum(x)))
summary(rowSums(new.p))
But I worry that the cause of the issues with the probabilities summing to 1 means that this wouldn't be appropriate.
Is my worry founded? Is there another element of fm_hp1
that I need to modify in order to to change the intercept term but still get correct probability predictions?