-1

So here is my situation: I have the following dataset and I try for example to find the conditional probability that a person x is Sex=f, Weight=l, Height=t and Long Hair=y.

enter image description here

When I calculate this by hand, the probability is 0.0333. But when I try to predict it from R, I get a different number.

library(naivebayes)
train <- read.csv2("c:/....csv")

classifier <- naive_bayes(Sex ~ .,train)
classifier
> test <- data.frame(Height=c("t"), Weight=c("l"), Long.Hair=c("y"))
> test$Height <- factor(test$Height, levels=c("m","s","t"))
> test$Weight <- factor(test$Weight, levels=c("n","l","h"))
> test$Long.Hair <- factor(test$Long.Hair, levels=c("y","n"))
> test
  Height Weight Long.Hair
1      t      l         y
> prediction <- predict(classifier, test ,type="prob")
> prediction
             f          m
[1,] 0.9881423 0.01185771

Is there a way that I can find the one that I get by hand?

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
Ma Ov
  • 1
  • 2
  • 1
    How does your calculation by hand look like? We also do know nothing about your model and how you calculated it. Which means it is impossible to help you. – deschen May 16 '22 at 20:21
  • Yes Im really sorry, I thought I pasted it, just edited the post. The calculation goes like this: P(Height=t|Sex=f)*P(Weight=l|Sex=f)*P(Long.Hair=y|Sex=f)*P(Sex=f) = 1/6 * 3/6 * 4/6 * 6/10 = 1/30 – Ma Ov May 16 '22 at 20:28

1 Answers1

4

Your calculation by hand isn't right. In the sample data, the only people with long hair are women, so the conditional probability of being female given long hair is 1 if you work it out by hand.

The only reason the prediction is giving you a probability of (slightly) less than 1 is that the predict method is adding a small amount of Laplace smoothing to the predictions, as you will see in the source code. It always does this by default, but you can effectively turn it off by setting it to a tiny non-zero number

classifier <- naive_bayes(Sex ~ .,train, laplace = .Machine$double.eps)
prediction <- predict(classifier, test ,type="prob")
prediction
#>      f            m
#> [1,] 1 6.661338e-16

I suppose we could call this a very naive Bayes model.


Data from question in reproducible format

train <- data.frame(
  Height    = c("m", "s", "t", "s", "t", "s", "s", "m", "m", "t"),
  Weight    = c("n", "l", "h", "n", "n", "l", "h", "n", "l", "n"),
  Long.Hair = c("n", "y", "n", "y", "y", "n", "n", "n", "y", "n"),
  Sex       = c("m", "f", "m", "f", "f", "f", "m", "f", "f", "m"))
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • Thank you for your answer, but I don't think that the calculation is incorrect. The probability that I am counting here is given that the sex is female what is the likelihood of having long hair(not the opposite). – Ma Ov May 17 '22 at 15:47
  • @MaOv Your classifier model has sex on the left hand side, so it predicts sex based on weight, hair and height. Your `test` data frame specifies the hair, weight and height, and you are asking `predict` to tell you what the probability of each Sex is given these three independent variables. If this is not what you are trying to do, then you need to respecify your model. – Allan Cameron May 17 '22 at 15:54
  • 1
    One can specify the threshold in the predict function: `classifier2 <- naive_bayes(Sex ~ .,train, laplace = 0); prediction2 <- predict(classifier2, test , type = "prob", threshold = .Machine$double.xmin, eps = 0); prediction2` The default value is, indeed, set to `0.001`. It is the same default as in `e1071:::predict.naiveBayes()` so that there are no discrepancies between those two functions and packages in general. The threshold in the `predict.bernoulli_naive_bayes()`, is hardcoded to 0.001 but you can control it using `laplace` exactly as you suggested. – Michal Majka Jun 01 '22 at 14:20