1

I am working on a classifier, by logistic regression, based on Spark ML. and I wonder should I train the equal quantity of data for true , false.

I mean When I want to classify people into male or female, Is it ok that train a model with 100 male data + 100 female data.

The online people may 40% male and 60% female , but this percent is forcasted based on the past, so it can be change(like 30% female, 70% male)

In this situation. what female/male percent of data should I train? is this related with overfitting?

when If I trained a model with 40%female + 60%male, It is useless to classifying a field data composed with 70%female+30%male?

Spark classification sample data has 43 false, 57true. https://github.com/apache/spark/blob/master/data/mllib/sample_binary_classification_data.txt

what means the true/false ratio of trainig data in logisticregression?

I am really not good at English, but hope you understand me.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Jihun No
  • 1,201
  • 1
  • 14
  • 29

1 Answers1

3

It should not matter what ratio you use, as long as it is reasonable.

60:40, 30:70, 50:50, it's okay. Just make sure it's not too lopsided, like 99:1.

If the entire data set is 70:30 female:male, and you want to only use a subset of this dataset, going for a 60:40 female:male ratio will not kill you.

Consider the following example: Your test data contains 99% males, and 1 % female.

Technically, you can classify all males correctly, ALL females incorrectly, and your algorithm would show an error of 1%. Seems pretty good right? No, because your data is too lopsided.

This low error is not a result of overfitting (high variance), but rather a result of a lopsided data set.

This is an extreme example, but you get the point.

Red Ghost
  • 302
  • 2
  • 12