7

I try to use Spark MLlib Logistic Regression (LR) and/or Random Forests (RF) classifiers to create model to descriminate between two classes reprsented by sets which cardinality differes quite a lot.
One set has 150 000 000 negative and and another just 50 000 positive instances.

After training both LR and RF classifiers with default parameters I get very similar results for both classifiers with, for example, for the following test set:

Test instances: 26842
Test positives = 433.0 
Test negatives = 26409.0

Classifier detects:

truePositives = 0.0  
trueNegatives = 26409.0  
falsePositives = 433.0  
falseNegatives = 0.0 
Precision = 0.9838685641904478
Recall = 0.9838685641904478

It looks like classifier can not detect any positive instance at all. Also, no matter how data was split into train and test sets, classifier provides exactly the same number of false positives equal to a number of positives that test set really has.

LR classifier default threshold is set to 0.5 Setting threshold to 0.8 does not make any difference.

val model =  new LogisticRegressionWithLBFGS().run(training)
model.setThreshold(0.8)

Questions:

1) Please advise how to manipulate classifier threshold to make classifier more sensetive to a class with a tiny fraction of positive instances vs a class with huge amount of negative instances?

2) Any other MLlib classifiers to solve this problem?

3) What itercept parameter does to the Logistic Regression algorithm?

val model = new LogisticRegressionWithSGD().setIntercept(true).run(training)
zero323
  • 322,348
  • 103
  • 959
  • 935
zork
  • 2,085
  • 6
  • 32
  • 48
  • You can try to perform a grid search over the parameters and cross validate your model to see what models fits best. You should be careful about overfitting thought! Concerning the intercept its the constant value that you add to the weight vector that can help you fit your function – eliasah Aug 03 '15 at 17:21
  • @eliasah It is unlikely to be helpful with extremely skewed distribution like here. – zero323 Aug 03 '15 at 17:32
  • 1
    @zork It should be `falsePositives = 0.0`, `falseNegatives = 433.0`, shouldn't it? – zero323 Aug 03 '15 at 17:35
  • Ideally it should be `truePositives = 433` and `falsePositives = falseNegatives = 0` – zork Aug 04 '15 at 10:20
  • @eliasah could you please give some links to algorithms and implementations of *a grid search over the parameters* that you are writing about? – zork Aug 04 '15 at 10:24
  • 1
    scikit-learn presents a good explanation about [grid search](http://scikit-learn.org/stable/modules/grid_search.html) which is mainly a search for estimators parameters. You can also read about it [here](https://en.wikipedia.org/wiki/Hyperparameter_optimization). I'm pointing you to these direction since I don't know what your data looks like, nor what it represents which can be essential in some case of the optimization process. – eliasah Aug 04 '15 at 11:44
  • Thanks! And does Logistic Regression have any hyperparameters to optimize ? What are they? – zork Aug 04 '15 at 13:14
  • @zork - what was your best solution here? facing the same problem – Yaeli778 May 10 '16 at 13:04

1 Answers1

6

Well, I think what you have here is a very unbalance data set problem: 150 000 000 Class1 50 000 Class2. 3000 times smaller.

So if you train a classifier that assumes all are Class1 you are going to have: 0.999666 accuracy. So the best classifier will always be ALL are Class1. This is what your model is learning here.

There are different ways to assess these cases, in general you can do, downsampling the larger Class, or up-sampling the smaller class, or there are some other things you can do with randomforests for example when you sample do it in a balanced way (stratified), or add weights:

http://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf

Other methods also exist like SMOTE,etc (also doing samples) for more details you can read here:

https://www3.nd.edu/~dial/papers/SPRINGER05.pdf

The threshold you can change for your logistic regression is going to be the probability one, you can try playing with "probabilityCol" in the parameters of the logistic regression example here:

http://spark.apache.org/docs/latest/ml-guide.html

But a problem now with MLlib is that not all classifiers are returning a probability, I asked them about this and it is in their roadmap.

Dr VComas
  • 735
  • 7
  • 22