Logistic regression training data set true/false ratio

Question

I am working on a classifier, by logistic regression, based on Spark ML. and I wonder should I train the equal quantity of data for true , false.

I mean When I want to classify people into male or female, Is it ok that train a model with 100 male data + 100 female data.

The online people may 40% male and 60% female , but this percent is forcasted based on the past, so it can be change(like 30% female, 70% male)

In this situation. what female/male percent of data should I train? is this related with overfitting?

when If I trained a model with 40%female + 60%male, It is useless to classifying a field data composed with 70%female+30%male?

Spark classification sample data has 43 false, 57true. https://github.com/apache/spark/blob/master/data/mllib/sample_binary_classification_data.txt

what means the true/false ratio of trainig data in logisticregression?

I am really not good at English, but hope you understand me.

score 3 · Answer 1 · answered Oct 31 '15 at 16:52

It should not matter what ratio you use, as long as it is reasonable.

60:40, 30:70, 50:50, it's okay. Just make sure it's not too lopsided, like 99:1.

If the entire data set is 70:30 female:male, and you want to only use a subset of this dataset, going for a 60:40 female:male ratio will not kill you.

Consider the following example: Your test data contains 99% males, and 1 % female.

Technically, you can classify all males correctly, ALL females incorrectly, and your algorithm would show an error of 1%. Seems pretty good right? No, because your data is too lopsided.

This low error is not a result of overfitting (high variance), but rather a result of a lopsided data set.

This is an extreme example, but you get the point.

No worries Jihun, glad to have helped :) – Red Ghost Nov 02 '15 at 00:07 — Red Ghost, Nov 02 '15 at 00:07

Logistic regression training data set true/false ratio

1 Answers1