I am working on a classifier, by logistic regression, based on Spark ML. and I wonder should I train the equal quantity of data for true , false.
I mean When I want to classify people into male or female, Is it ok that train a model with 100 male data + 100 female data.
The online people may 40% male and 60% female , but this percent is forcasted based on the past, so it can be change(like 30% female, 70% male)
In this situation. what female/male percent of data should I train? is this related with overfitting?
when If I trained a model with 40%female + 60%male, It is useless to classifying a field data composed with 70%female+30%male?
Spark classification sample data has 43 false, 57true. https://github.com/apache/spark/blob/master/data/mllib/sample_binary_classification_data.txt
what means the true/false ratio of trainig data in logisticregression?
I am really not good at English, but hope you understand me.