-1

As a title, I tried to use AutoML in Google Cloud Platform to predict some rare results. For example, suppose I have 5 types of independent variables: age, living area, income, family size, and gender. I want to predict a rare event called "purchase". Purchases are very rare, because for 10,000 data points, I will only get 3-4 purchases. Fortunately, I got loads more than just 10,000 data points. (I got 100 million data points)

I have tried to use AutoML to model the best combination, but since this is a rare result, the model only predicts for me that the number of purchases for all types of combinations in these 5 categories is 0. May I know how to solve this problem in AutoML?

Mustafa Kemal
  • 762
  • 1
  • 5
  • 11
atsang01
  • 207
  • 3
  • 12
  • I think the issue here is with your data not AutoML. So the data where "purchase" variable is missing is basically useless in this case and for the data where it is available probably it is biased data. AutoML does not do data processing or feature engineering so you gotta find out how you can improve your data. – Gray_Rhino Sep 27 '21 at 10:07
  • oh sorry, maybe I should be more clear on the above. The "purchase" is not missing at all, it just rarely happened, i.e. only 0.03% chance for the people will make the purchase. – atsang01 Sep 27 '21 at 10:31
  • so the purchase is binary? and only 0.03% of customers made a purchase? If you give me your independent variables and I return 0, I will be right 99.7% of the time? If those are the cases, I do not think you will need ML for that. – Gray_Rhino Sep 28 '21 at 01:37

1 Answers1

0

In Cloud AutoML, the model predictions and the model evaluation metrics depend on the confidence threshold that is set. By default, in Cloud AutoML, the confidence threshold is 0.5. This value can be changed in the “Evaluate” tab of the “Models” section. To evaluate your model, change the confidence threshold to see how precision and recall are affected. The best confidence threshold depends on your use case. Here are some example scenarios to learn how evaluation metrics can be used. In your case, the recall metric has to be maximized (which would result in fewer false negatives) in order to correctly predict the purchase column.

Also, the training data has to be composed of a comparable number of examples from each class in the target variable so that the model can predict values with a higher confidence. Since your training data is highly skewed, preprocessing of the data such as resampling has to be performed to handle the skewness.

Vishal K
  • 1,368
  • 1
  • 7
  • 15
  • Thanks for your advice Vichal! I will do my own research on maximising recall metrics and resampling. Thanks again for your help – atsang01 Oct 04 '21 at 08:37