1

I am interested in any tips on how to train a set with a very limited positive set and a large negative set.

I have about 40 positive examples (quite lengthy articles about a particular topic), and about 19,000 negative samples (most drawn from the sci-kit learn newsgroups dataset). I also have about 1,000,000 tweets that I could work with.. negative about the topic I am trying to train on. Is the size of the negative set versus the positive going to negatively influence training a classifier?

I would like to use cross-validation in sci-kit learn. Do I need to break this into train / test-dev / test sets? Is know there are some pre-built libraries in sci-kit. Any implementation examples that you recommend or have used previously would be helpful. Thanks!

JJSanDiego
  • 721
  • 5
  • 11
  • Which type of classifier do you intend to use? – piman314 Feb 11 '16 at 17:22
  • I have a random forest classifier... basically reads in a TSV file, creates a bag of words, then generates the vectorizer and random forest classifier. I would like to see this with SVM, Naïve Bayes, but have not implemented yet. Thanks for the help. Any comments/ guidance appreciated. Output should be 1 or 0 if the topic is detected. – JJSanDiego Feb 11 '16 at 17:24

2 Answers2

1

The answer to your first question is yes, the amount by which it will affect your results depends on the algorithm. My advive would be to keep an eye on the class-based statistics such as recall and precision (found in classification_report).

  • For RandomForest() you can look at this thread which discusses the sample weight parameter. In general sample_weight is what you're looking for in scikit-learn.

  • For SVM's have a look at either this example or this example.

  • For NB classifiers, this should be handled implicitly by Bayes rule, however in practice you may see some poor performances.

For you second question it's up for discussion, personally I break my data into a training and test split, perform cross validation on the training set for parameter estimation, retrain on all the training data and then test on my test set. However the amount of data you have may influence the way you split your data (more data means more options).

Community
  • 1
  • 1
piman314
  • 5,285
  • 23
  • 35
  • Awesome answer. Thank you! – JJSanDiego Feb 11 '16 at 19:25
  • Should I try to break up my articles to achieve more samples? What percentage of the positive negative data would you recommend for training and test split (80% / 20%)? – JJSanDiego Feb 12 '16 at 19:13
  • I don't know whether breaking up the articles is a good idea, it depends on the length of them. Do you think your tf-idf vector will describe half the document as well as the entire thing? Maybe look at some unsupervised learning techniques such as clustering to get a feel for that. If you intend to use this model to predict, I would always recommend a larger test set than 20%, ideally 50-66%. Can you mark my answer please so others can see it as the answer to your question? – piman314 Feb 13 '16 at 11:17
1

You could probably use Random Forest for your classification problem. There are basically 3 parameters to deal with data imbalance. Class Weight, Samplesize and Cutoff.

Class Weight-The higher the weight a class is given, the more its error rate is decreased.

Samplesize- Oversample the minority class to improve class imbalance while sampling the defects for each tree[not sure if Sci-kit supports this, used to be param in R)

Cutoff- If >x% trees vote for the minority class, classify it as minority class. By default x is 1/2 in Random forest for 2-class problem. You can set it to a lower value for the minority class.

Check out balancing predict error at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

For the 2nd question if you are using Random Forest, you do not need to keep separate train/validation/test set. Random Forest does not choose any parameters based on a validation set, so validation set is un-necessary.

Also during the training of Random Forest, the data for training each individual tree is obtained by sampling by replacement from the training data, thus each training sample is not used for roughly 1/3 of the trees. We can use the votes of these 1/3 trees to predict the out of box probability of the Random forest classification. Thus with OOB accuracy you just need a training set, and not validation or test data to predict performance on unseen data. Check Out of Bag error at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm for further study.

  • I disagree with your statement that Random Forests do not chose parameters based on a validation set. Depth of the tree and number of features to use for each split are important to get right. Number of features is especially important in NLP applications as tf-idf vectors often contain uninformative features. – piman314 Feb 13 '16 at 11:23