Thanks for taking the time to read my question!
So I am running an experiment to see if I can predict whether an individual has been diagnosed with depression (or at least says they have been) based on the words (or tokens)they use in their tweets. I found 139 users that at some point tweeted "I have been diagnosed with depression" or some variant of this phrase in an earnest context (.e. not joking or sarcastic. Human beings that were native speakers in the language of the tweet were used to discern whether the tweet being made was genuine or not).
I then collected the entire public timeline of tweets of all of these users' tweets, giving me a "depressed user tweet corpus" of about 17000 tweets.
Next I created a database of about 4000 random "control" users, and with their timelines created a "control tweet corpus" of about 800,000 tweets.
Then I combined them both into a big dataframe,which looks like this:
,class,tweet
0,depressed,tweet text .. *
1,depressed,tweet text.
2,depressed,@ tweet text
3,depressed,저 tweet text
4,depressed,@ tweet text
5,depressed,@ tweet text
6,depressed,@ tweet text ?
7,depressed,@ tweet text ?
8,depressed,tweet text *
9,depressed,@ tweet text ?
10,depressed,@ tweet text
11,depressed,tweet text *
12,depressed,#tweet text
13,depressed,
14,depressed,tweet text !
15,depressed,tweet text
16,depressed,tweet text. .
17,depressed,tweet text
...
50595,control,@tweet text?
150596,control,"@ tweet text."
150597,control,@ tweet text.
150598,control,"@ tweet text. *"
150599,control,"@tweet text?"t
150600,control,"@ tweet text?"
150601,control,@ tweet text?
150602,control,@ tweet text.
150603,control,@tweet text~
150604,control,@ tweet text.
Then I trained a multinomial naive bayes classifier using an object from the CountVectorizer() class imported from the sklearn library:
count_vectorizer = CountVectorizer()
counts = count_vectorizer.fit_transform(tweet_corpus['tweet'].values)
classifier = MultinomialNB()
targets = tweet_corpus['class'].values
classifier.fit(counts, targets)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior= True)
Unfortunately, after running a 6-fold cross validation test, the results suck and I am trying to figure out why.
Total tweets classified: 613952
Score: 0.0
Confusion matrix:
[[596070 743]
[ 17139 0]]
So, I didn't properly predict a single depressed person's tweet! My initial thought is that I have not properly normalized the counts of the control group, and therefore even tokens which appear more frequently among the depressed user corpus are over represented in the control tweet corpus due to its much larger size. I was under the impression that .fit() did this already, so maybe I am on the wrong track here? If not, any suggestions on the most efficient way to normalize the data between two groups of disparate size?