I have used sklearn to create a basic multiclass naive bayes text classifier. I have 3 classes and around 800 rows of data. Class A has 564 rows, Class B has 159, and Class C has 82. As you can see the data is unbalanced among the classes and I understand that this can affect the accuracy because Bayes Theorem takes into account the probability of a word occurring in the text given that the text is of a specific class in order to figure out the probability of the text being of said class given that it has the word in the text. This was my first go, and I plan to get more data, as you might imagine Class A was the easiest to get while Class C was the hardest to attain.
I am however confused as to how I should be approaching creating and improving this model and how balanced the class data sets should be. If I were to get perfectly proportionate data for each class say 1000 rows of data for each class, or undersample the data i already have, wouldnt this affect the accuracy as well? Since in reality, the occurrence of Class C is actually definitely less likely than A and B. In reality the proportions of the classes are somewhat similar (although varying from person to person) to the probability of a text being of said Class. And since the Bayes Theorem also takes into account the Probability of a piece of text being a specific class in order to calculate the probability of a text being a specific class given that it contains a word, wouldn't creating a balanced dataset with equal number of rows for each class decrease the accuracy as the probability of a class occuring in production is not taken into account as the probability is now essentially constant and the same for all classes since they occur equally. Although making all classes equal does remove the bias of a word due to unbalanced datasets.
So I am unsure how to approach creating this model efficiently as I feel with unbalanced data, common words in Class C are perceived by the model to be more likely to occur in an email of Class A when in reality they are probably more common in C but the skewed data is creating this bias. On the other hand, making the classes balanced ignores the actual probability of a piece of text being a specific Class although I have no way of calculating a universal probability of each class that is accurate for all individuals, (does that mean that making the classes balanced has less of a negative effect on accuracy?). Any guidance is greatly appreciated, I am quite new to this.