Scikit-Learn - No True Positives - Best Way to Normalize Data

Question

Thanks for taking the time to read my question!

So I am running an experiment to see if I can predict whether an individual has been diagnosed with depression (or at least says they have been) based on the words (or tokens)they use in their tweets. I found 139 users that at some point tweeted "I have been diagnosed with depression" or some variant of this phrase in an earnest context (.e. not joking or sarcastic. Human beings that were native speakers in the language of the tweet were used to discern whether the tweet being made was genuine or not).

I then collected the entire public timeline of tweets of all of these users' tweets, giving me a "depressed user tweet corpus" of about 17000 tweets.

Next I created a database of about 4000 random "control" users, and with their timelines created a "control tweet corpus" of about 800,000 tweets.

Then I combined them both into a big dataframe,which looks like this:

,class,tweet
0,depressed,tweet text .. *
1,depressed,tweet text.
2,depressed,@ tweet text
3,depressed,저 tweet text
4,depressed,@ tweet text
5,depressed,@ tweet text
6,depressed,@ tweet text ?
7,depressed,@ tweet text ?
8,depressed,tweet text *
9,depressed,@ tweet text ?
10,depressed,@ tweet text
11,depressed,tweet text *
12,depressed,#tweet text
13,depressed,
14,depressed,tweet text !
15,depressed,tweet text
16,depressed,tweet text. .
17,depressed,tweet text
...
50595,control,@tweet text?
150596,control,"@ tweet text."
150597,control,@ tweet text.
150598,control,"@ tweet text. *"
150599,control,"@tweet text?"t
150600,control,"@ tweet text?"
150601,control,@ tweet text?
150602,control,@ tweet text.
150603,control,@tweet text~
150604,control,@ tweet text.

Then I trained a multinomial naive bayes classifier using an object from the CountVectorizer() class imported from the sklearn library:

count_vectorizer = CountVectorizer()
counts = count_vectorizer.fit_transform(tweet_corpus['tweet'].values)

classifier = MultinomialNB()
targets = tweet_corpus['class'].values
classifier.fit(counts, targets)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior= True)

Unfortunately, after running a 6-fold cross validation test, the results suck and I am trying to figure out why.

Total tweets classified: 613952
Score: 0.0
Confusion matrix:
[[596070    743]
 [ 17139      0]]

So, I didn't properly predict a single depressed person's tweet! My initial thought is that I have not properly normalized the counts of the control group, and therefore even tokens which appear more frequently among the depressed user corpus are over represented in the control tweet corpus due to its much larger size. I was under the impression that .fit() did this already, so maybe I am on the wrong track here? If not, any suggestions on the most efficient way to normalize the data between two groups of disparate size?

@ser I am going to try using a Bernoulli classifier as well. Is there another on you would recommend in this situation? — user8244450, Jul 03 '17 at 13:52
Since you have strings/words as input data, you could also look at the Kmeans in order to classify the words. 1) Document Clustering with Python [link](http://brandonrose.org/clustering) 2) Clustering text documents using scikit-learn kmeans in Python [link](https://stackoverflow.com/questions/27889873/clustering-text-documents-using-scikit-learn-kmeans-in-python) 3) Clustering a long list of strings (words) into similarity groups [link](https://stats.stackexchange.com/questions/123060/clustering-a-long-list-of-strings-words-into-similarity-groups) — seralouk, Jul 03 '17 at 13:59
That might be interesting as part of my exploratory data analysis, I'll look into it. Thanks! — user8244450, Jul 03 '17 at 14:22

score 0 · Accepted Answer · answered Jul 02 '17 at 14:28

0

You should use a re-sampling techniques to deal with unbalanced classes. There are many ways to do that "by hand" in Python, but I recommend unbalanced learn which compiles re-sampling techniques commonly used in datasets showing strong between-class imbalance.

If you are using Anaconda, you can use:

conda install -c glemaitre imbalanced-learn.

or simply:

pip install -U imbalanced-learn

This library is compteible with sci-kit learn. Your dataset looks very interesting, is it public? Hope this helps.

answered Jul 02 '17 at 14:28

Luis Miguel

5,057
8
42
75

1

I will take a look at unbalanced learn, as well as use a precision-recall curve to see if maybe different decision thresholds might help. All of my data was compiled from public tweets from public twitter accounts. Thanks for the response! – user8244450 Jul 03 '17 at 07:38

Scikit-Learn - No True Positives - Best Way to Normalize Data

1 Answers1