-2

I have an ML problem. I have an machine learning classification task where the classifications are either -1, 0 or 1. In reality, the vast majority of the time the correct classification is 0, and approx 1% of the time, the answer is -1 or 1.

When training (I'm using auto_ml but I think this is a general problem) I'm finding that my model decides it can get a 99% accuracy by just predicting a 0 every time.

Is this a known phenomenon? Is there anything I can do to work around this other than come up with more classifications? Maybe something which splits the 0s into different classes.

Any advice, or pointers at what to read up on next are appreciated.

Thanks.

Ludo
  • 2,739
  • 2
  • 28
  • 42
  • You have just stumbled upon a class imbalance problem; google and start digging (it is a whole subfield)... – desertnaut Oct 04 '18 at 08:54

2 Answers2

1

You should look deeper into your dataset. Seems, your dataset is imbalanced. Possible solutions:

  • try to balance your dataset - add more data with labels 1 and -1 or reduce number of rows with 0 label;
  • if it's not possible make your dataset balanced, try change approach. You can assume, that labels 1 and -1 are outliers and try to solve a problem of finding outliers. Here are some examples how to deal with outliers using library scikit-learn;
Danylo Baibak
  • 2,106
  • 1
  • 11
  • 18
0

Yeah, ML can be lazy ;-)

You could try including more of the rare cases into your training set. Though, you use the word 'Event' which makes me wonder if you're doing some kind of time series analysis - is this some kind of recurrent net? If so then training with more of the rare events might be unrealistic.

aneccodeal
  • 8,531
  • 7
  • 45
  • 74