I am training GBC. It is multi class classifier with 12 classes of outputs. My issue is I am not getting 100% accuracy when i predict on the train data. In fact, misprediction happens on dominant set of classes. (my input is imbanalanced and i do synthetic data creation.)
Here are details: Input data size: Input shape: (20744, 13) (doing label encoding and minmax scaling on output and input)
Distribution before scaling of data:
[(0, 443), **(1, 6878),** (2, 177), (3, 1255), (4, 311), (5, 172), (6, 1029), (7, 268), (8, 131), (9, 54), (10, 1159), (11, 340), (12, 1370),
**(13, 7157)**]
Oversampling with random oversampler
[(0, 7157), (1, 7157), (2, 7157), (3, 7157), (4, 7157), (5, 7157), (6, 7157), (7, 7157), (8, 7157), (9, 7157), (10, 7157), (11, 7157), (12, 7157), (13, 7157)]
final shapes after preprocessing:
Input shape X: (100198, 12)
Target Shape Y: (100198, 1)
Model: est = GradientBoostingClassifier(verbose=3, n_estimators=n_est, learning_rate=0.001, max_depth =24, min_samples_leaf=3, max_features=3)
outputs:
ACC: 0.9632
Feature importance:
[0.09169515 0.01167983 0. 0. 0.11126567 0.14089752
0.12381927 0.10735138 0.1344401 0.13874134 0.08111774 0.058992 ]
Accuracy score on Test data: 19303
[[1406 0 0 0 0 0 0 0 0 0 0 0 0 0]
[ 19 1024 4 32 4 5 24 5 0 0 24 8 48 211]
[ 0 0 1434 0 0 0 0 0 0 0 0 0 0 0]
[ 1 8 0 1423 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 1441 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 1430 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 1439 0 0 0 3 0 0 1]
[ 0 0 0 0 0 0 0 1453 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 1432 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 1445 0 0 0 0]
[ 0 2 0 0 0 0 0 0 0 0 1398 0 0 1]
[ 0 0 0 0 0 0 0 0 0 0 0 1411 0 0]
[ 0 5 0 1 0 0 0 0 0 0 0 0 1413 6]
[ 1 154 9 22 12 6 22 6 3 8 17 20 45 1154]]
Precision on Test data: 0.9632235528942116
Recall on Test data: 0.9632235528942116
**The problem I see is when i predict on train data: I expect a 100% prediction. But somehow my dominant classes are not 100% predicted. Any reason?
ACC: 0.9982**
Accuracy score on Train data: 80016
[[5751 0 0 0 0 0 0 0 0 0 0 0 0 0]
[ **0 5699 2 2 1 0 1 3 3 2 0 2 2 32**]
[ 0 0 5723 0 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 5725 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 5716 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 5727 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 5714 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 5704 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 5725 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 5712 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 5756 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 5746 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 1 5731 0]
[ **0 4 5 5 5 2 9 8 2 16 6 19 10 5587**]]
Precision on Train data: 0.9982284987150378 Recall on Train data: 0.9982284987150378
Any idea as what's going wrong?