Can imbalance in class ratio in training and testing set cause poor validation accuracy?

Question

I’m participating in a hackathon where we are supposed to predict whether a user is interested in jobs given features like gender, city, training hours, experience, current company etc.

In training set there are about 90% who are not interested in jobs while only 10% who are. But in the public testing set they have released there are 50% of each of those 2 classed and my validation accuracy is not going above 55% while training accuracy is 99%.

Both testing and training data have missing values and I’m imputing it using RBM.

My question is:

Validation accuracy is terrible because of the imbalance in ratio of classes or is it due to wrongly imputing the missing values?

Are you balancing your training set before you start to train with it? — Tim, Jul 20 '18 at 05:38
@TimH I’m feeding it as it is. 90% not interested and 10% interested. How do I balance my data? — Vikas NS, Jul 20 '18 at 06:47
that can be an issue, imagine if you classifier predicts everyone as not interested, it would still get an accuracy of 90% so accuracy alone is not a good performance measure — words_of_wisdom, Jul 20 '18 at 07:07

score 1 · Accepted Answer · answered Jul 20 '18 at 07:10

Explanation:

Ok I think you need to resample your data first, because your algorithm learns that most of the people are not interessted in jobs and that's true if we just look at the distribution of your training data (90% not interested, 10% interested). Your algorithm just assumes the answer is always "not interested". That's why you reach a high accuracy on the training set.

In your test data the distribution changed to 50%:50%. Your algorithm still assumes all persons are "not interested" and fails in predicting the interested ones. (Your accuracy decreases on the test set to roughly 50%)

How to solve this problem:

Resample your training data to match the 50%:50% distribution in the training set. There are different resampling methods available. Some examples are:

Under-Sampling
Over-Sampling
Synthetic Minority Over-Sampling Technique (SMOTE)

Under-Sampling: downsamples the majority class by removing items. In your case it would be (10% interested and 10% not interseted).The disadvantage is that you would just learn on 20% of the available training data.

Over-Sampling: upsamples the minority class by adding redundant points. Advantage: You would use all your data. Disadvantage: Could lead to overfitting.

SMOTE: a more sophisticated over-sampling method, whichs adds synthetic samples.

I would try to start using simple over- and check if this solves your problem.

For python you could use the so called imbalanced-learn package, which contains all of the stated methods here.

I agree with Tim here. You will also need to focus on how to avoid overfitting once you balance the dataset as it seems like you have overfitted to your training set as well. — words_of_wisdom, Jul 20 '18 at 07:13
@TimH I tried SMOTE , my data was beefed up to 30k samples. Using SVM I got an accuracy of 62 %. What I noticed is that output 15k test samples, 10k were predicted no and only 5k were predicted yes. But required is 7.5 for each. — Vikas NS, Jul 20 '18 at 14:57
Maybe , we can decrese the threshold to classify a sample as 'interested'. Like if output > 0.3 Interested and output < 0.3 rejected ? what do you think? — Vikas NS, Jul 20 '18 at 14:59

Can imbalance in class ratio in training and testing set cause poor validation accuracy?

1 Answers1

Explanation:

How to solve this problem: