My training data set contains 46071 examples from one class and 33606 examples from another class. Does this result in a skewed classifier? I am using SVM but don't want to use SVM's options to deal with skewed data.
1 Answers
A dataset is skewed if the classification categories are not approximately equally represented (I don‘t think there is a precise value).
Yours isn‘t a highly unbalanced dataset. Anyway it could introduce bias toward majority (potentially uninteresting) class, especially using accuracy for evaluating classifiers.
Skewed training sets can be managed in various ways. Two frequently used approach are:
At the data level a form of re-sampling such as
- random oversampling with replacement,
- random undersampling,
- directed oversampling (no new examples are created, the choice of samples to replace is informed rather than random),
- directed undersampling,
- oversampling with informed generation of new samples,
- combinations of the above techniques.
At the algorithmic level, adjusting the costs of the various classes so as to counter the class imbalance.
Even if you don't like this approach, with SVM you can change the class weighting scheme (e.g. How should I teach machine learning algorithm using data with big disproportion of classes? (SVM)). You could prefer this to sub-sampling as it means there is no variability in the results due to the particular sub-sample used.
It's worth noting that (from Issue on Learning from Imbalanced Data Sets):
in certain domains (e.g. fraud detection) the class imbalance is intrinsic to the problem: there are typically very few cases of fraud as compared to the large number of honest use of the facilities.
However, class imbalances sometimes occur in domains that do not have an intrinsic imbalance.
This will happen when the data collection process is limited (e.g. due to economic or privacy reasons), thus creating articial imbalances.
Conversely, in certain cases, the data abounds and it is for the scientist to decide which examples to select and in what quantity.
In addition, there can also be an imbalance in costs of making different errors, which could vary per case.
So it all depends on your data, really!
Further details: