-1

I have used sklearn to create a basic multiclass naive bayes text classifier. I have 3 classes and around 800 rows of data. Class A has 564 rows, Class B has 159, and Class C has 82. As you can see the data is unbalanced among the classes and I understand that this can affect the accuracy because Bayes Theorem takes into account the probability of a word occurring in the text given that the text is of a specific class in order to figure out the probability of the text being of said class given that it has the word in the text. This was my first go, and I plan to get more data, as you might imagine Class A was the easiest to get while Class C was the hardest to attain.

I am however confused as to how I should be approaching creating and improving this model and how balanced the class data sets should be. If I were to get perfectly proportionate data for each class say 1000 rows of data for each class, or undersample the data i already have, wouldnt this affect the accuracy as well? Since in reality, the occurrence of Class C is actually definitely less likely than A and B. In reality the proportions of the classes are somewhat similar (although varying from person to person) to the probability of a text being of said Class. And since the Bayes Theorem also takes into account the Probability of a piece of text being a specific class in order to calculate the probability of a text being a specific class given that it contains a word, wouldn't creating a balanced dataset with equal number of rows for each class decrease the accuracy as the probability of a class occuring in production is not taken into account as the probability is now essentially constant and the same for all classes since they occur equally. Although making all classes equal does remove the bias of a word due to unbalanced datasets.

So I am unsure how to approach creating this model efficiently as I feel with unbalanced data, common words in Class C are perceived by the model to be more likely to occur in an email of Class A when in reality they are probably more common in C but the skewed data is creating this bias. On the other hand, making the classes balanced ignores the actual probability of a piece of text being a specific Class although I have no way of calculating a universal probability of each class that is accurate for all individuals, (does that mean that making the classes balanced has less of a negative effect on accuracy?). Any guidance is greatly appreciated, I am quite new to this.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
ssal
  • 95
  • 1
  • 1
  • 8
  • Please notice that SO is about *specific coding* questions; non-coding questions about ML & data science methodology are off-topic here, and should be posted at [Data Science SE](https://datascience.stackexchange.com/help/on-topic) instead. – desertnaut Aug 27 '20 at 09:37

2 Answers2

1

Tldr; Don't undersample/oversample, use text augmentation instead.

Undersampling/oversampling can be helpful in certain situations, but certainly not in your case with only 800 rows of data. Undersampling would make you lose too much valuable data, and oversampling would result in unreliable outcome. A much better solution would be to augment your data.

There are libraries like Snorkel that allows you to augment textual data by swapping or replacing with synonyms for adjectives, verbs, nouns, etc. in a probablistic way, which can greatly increase your data size. I highly recommend you taking a look at it, as it's often used in both academia and in the industry.

In regards to your concern with balancing your dataset, there are a few factors that can affect the outcome. Examples include the size of your dataset and overfitting, how distinctive the features are at classifying the samples, presence of outliers, etc. Just because you have 10k samples of cancer patients and 5k of healthy people, doesn't necessarily mean your prediction will be 2:1 ratio on real life dataset. That's because the model isn't necessarily memorizing the distribution of each class, but rather how the features result in the prediction of the class.

So in your example, if each class have distinctive words that often distinguishes one class from the other, you'd want to provide samples with those words in other classes to make sure you're not overfitting each class on those words.

Hope this helps!

  • I am definitely going to be getting more than 800 rows hopefully a couple thousand more. The data augmentation is interesting, so essentially classified text would have some of its words augmented like replaced with synonyms and that augmented text would also be classified similarly? – ssal Aug 27 '20 at 06:40
  • I still don't get what to do about the unbalanced data because yes while the model does memorize "how the features result in the prediction of the class." like you stated; part of bayes theorem is actually the basic probability of said class occurring without consideration of the features. That is multiplied by the conditional probability which is if the text is the class in question what is the probability that the word is present. The product is divided by the basic probability of the occurence of the word in all texts.So the distribution does actually impact the probability & classification – ssal Aug 27 '20 at 06:44
  • To follow your example with the cancer/healthy ratio of 2:1. If i wanted to figure out if a piece of text was of the cancer class given that it has each word in the email, part of the calculation would be the basic probability of the cancer class which is 10/15 so that goes into consideration when multiplying with the conditional probability that has to do with the feature/word itself. If I had a totally different distribution where I had 3k cancer data points and 20k healthy, then the basic prob would be 3/23. This can definitely make a difference. – ssal Aug 27 '20 at 06:48
  • Correct, prior priority of classes in training set will indeed affect NB based model (sorry, I've generalized my answer to other models!). The type of model you use is one of the factors to consider when balancing your dataset (ie- random forest handles class imbalance better than say, NB). If you know the true distribution of each classes, augment in a way that follows the true distribution and you'll have the best result. – Richie Youm Aug 27 '20 at 19:13
0

When training from an imbalanced training set, the variances of your classifier parameters grow large. The more skewed your prior class distribution is (A, B, C), the larger this problem becomes.

You are recommended, when possible, to train from a balanced training set (the same number of 'A' and 'B' and 'C' cases). Correction to the actual prior class distribution can take place afterwards, see correction formula for posterior probabilities.

Your subsets of cases from the different classes must be selected at random from your complete data set. This to avoid any selection bias.

Flying Dutchman
  • 135
  • 1
  • 6