11

Does any one know how to set parameter of alpha when doing naive bayes classification?

E.g. I used bag of words firstly to build the feature matrix and each cell of matrix is counts of words, and then I used tf(term frequency) to normalized the matrix.

But when I used Naive bayes to build classifier model, I choose to use multinomial N.B (which I think this is correct, not Bernoulli and Gaussian). the default alpha setting is 1.0 (the documents said it is Laplace smoothing, I have no idea what is).

The result is really bad, like only 21% recall to find the positive class (target class). but when I set alpha = 0.0001 (I randomly picked), the results get 95% recall score.

Besides, I checked the multinomial N.B formula, I think it is because the alpha problem, because if I used counts of words as feature, the alpha = 1 is doesn't to effect the results, however, since the tf is between 0-1, the alpha = 1 is really affect the results of this formula.

I also tested the results not use tf, only used counts of bag of words, the results is 95% as well, so, does any one know how to set the alpha value? because I have to use tf as feature matrix.

Thanks.

HAO CHEN
  • 1,209
  • 3
  • 18
  • 32
  • Can you share the precision obtained the recall becomes 95%? – shanmuga Nov 20 '15 at 16:22
  • did you check out http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html ? – James Tobin Nov 20 '15 at 16:59
  • @shanmuga, I did experiment on another dataset, using tf, alpha = 1.0, the recall of 'positive' is 0.11, the precision of 'positive' is 1.00 (weird?). and after I set the alpha = 0.0001,tf, the recall and precision of 'positive' are 1.00 , and the after I remove tf, only use counts of words as feature, and set alpha = 1.0, the recall of positive is 0.98, the precision of positive is 0.97. I used dataset that negative have 4243 instances and positive has 900 instances. – HAO CHEN Nov 20 '15 at 17:39
  • @JamesTobin, yes, I checked the web, that said in practice, fractional counts such as tf-idf may also work. and about how to set alpha, no other references. – HAO CHEN Nov 20 '15 at 17:41

3 Answers3

11

In Multinomial Naive Bayes, the alpha parameter is what is known as a hyperparameter; i.e. a parameter that controls the form of the model itself. In most cases, the best way to determine optimal values for hyperparameters is through a grid search over possible parameter values, using cross validation to evaluate the performance of the model on your data at each value. Read the above links for details on how to do this with scikit-learn.

jakevdp
  • 77,104
  • 11
  • 125
  • 160
  • thx, it's a good way to tune the alpha @jakevdp. could u pls say a little more about difference between parameter and hyperparameter? cheers – HAO CHEN Nov 21 '15 at 13:25
  • 1
    A hyperparameter is a parameter that defines the model, and must be chosen before the model sees any data (i.e. like ``alpha`` here it is set at initialization time). A normal model parameter, on the other hand, is free floating and set by fitting the model to data. One useful way to think about it is that hyperparameters *define the model*: so in some senses ``MultinomialNB`` with ``alpha=1`` and ``MultinomialNB`` with ``alpha=2`` should actually be considered fundamentally different models. – jakevdp Nov 21 '15 at 14:30
  • To test out results for different hyper parameters alpha what values we should be considering? Like for k in KNN we can take values like [3, 15, 25, 51, 101] – Dipen Gajjar Dec 21 '19 at 05:04
4

why alpha is used?

For classifying query point in NB P(Y=1|W) or P(Y=0|W) (considering binary classification) here W is vector of words W= [w1, w2, w3.... wd] d = number of features

So, to find probability of all these at training time
P(w1|Y=1) * P(w2|Y=1) *.....P(wd|Y=1)) * P(Y=1)

Same above should be done for Y=0.

For Naive Bayes formula refer this (https://en.wikipedia.org/wiki/Naive_Bayes_classifier)

Now at testing time, consider you encounter word which is not present in train set then its probability of existence in a class is zero, which will make whole probability 0, which is not good.

Consider W* word not present in training set

P(W*|Y=1) = P(W*,Y=1)/P(Y=1)

      = Number of training points such that w* word present and Y=1 / Number of training point where Y=1
      = 0/Number of training point where Y=1

So to get rid of this problem we do Laplace smoothing. we add alpha to numerator and denominator field.

     = 0 + alpha / Number of training point where Y=1 + (Number of class labels in classifier * alpha)
  1. It happens in real world, some words occurs very few time and some more number of times or think in different way, in above formula (P(W|Y=1) = P(W,Y=1)/P(Y=1) ) if numerator and denominator fields are small means It is easily influenced by outlier or noise. Here also alpha helps as it moves my likelihood probabilities to uniform distribution as alpha increases.

So alpha is hyper parameter and you have to tune it using techniques like grid search (as mentioned by jakevdp) or random search. (https://towardsdatascience.com/hyperparameter-tuning-c5619e7e6624)

Gopu_Tunas
  • 174
  • 2
  • 14
0

It is better that to use Gridsearchcv or RandomSearchcv(use this if on low spec model) for automating your hyperparameter which is alpha in case of MultinomialNB.

Do like this:

 model=MultinomialNB()
 param={'alpha': [0.00001, 0.0001, 0.001, 0.1, 1, 10, 100,1000]}
    
 clf=GridSearchCV(model,param,scoring='roc_auc',cv=10,return_train_score=True)