0

Scenario :
Currently I am working on multiclass classification problem. I have 2 million historical dataset of having 180 classes and need to create model which will predict the classes accurately. I have created model using HybridGradientboosting algorithm, and gives me descent accuracy around 80 - 85 %

Note : Checked other classification algorithms as well ,but not giving good performance in prediction.

Problem :
I did upsampling , downsampling , combination of both using imblearn libraries. Still facing problem while prediction. Means model is giving nice accuracy but for many of classes is not predicting correct.

Question :

  1. What kind of sampling strategy should apply on below dataset(sample dataset) ,which will create a good model for prediction ?

  2. Do I need to stack model : divide dataset in three range ,create three models and stack their results ?

Note: The below dataset does not contain any null values as well as duplicates.

Sample dataset :

class   number of records 
A       12385
B       6932
C       3183
D       999
E       900
F       891
G       802
H       760
I       630
J       264
K       257
L       257
M       161
N       132
O       77
P       59
Q       31
R       18
S       8

Can you please share your sampling strategy for such dataset.

Adding Code :

# sampling 
smote_enn = SMOTEENN(random_state=0,enn = EditedNearestNeighbours(kind_sel='mode'))
Xsample1, y_resampled1 = smote_enn.fit_resample(X, y)

`Before SMOTE : Counter({'A': 12385, 'B': 6932, 'C': 3183, 'D': 3158,'E': 955 ... many more classes

After SMOTE : [('B', 11873), ('C', 12320), ('D', 12327), ('A', 10404), ('E', 12326)] ...many more classes`

# SAP classification algorithm
# n_estimators,learning_rate,max_depth --> selected values after hyperparameter tuning

rdt_params = dict(random_state=2,n_estimators=16,learning_rate=0.25,max_depth=30)
uc_rdt = UnifiedClassification(func = 'HybridGradientBoostingTree', **rdt_params)

uc_rdt.fit(data=final,
           key= col_id, 
           features = features,
           label='class',
           partition_method='stratified',
           stratified_column='class', 
           partition_random_state=2,
           training_percent=0.8, ntiles=2)

Accuracy: 0.906 ; AUC: 0.9962 ; KAPPA: 3.9813

  • Can you please show what you have tried so far and where you are having an issue (possibly with a code example) in order that others might be able to assist. – D.L Feb 23 '22 at 09:08
  • Thank you for reply. Here I have used [SAP](https://help.sap.com/doc/1d0ebfe5e8dd44d09606814d83308d4b/2.0.06/en-US/hana_ml.algorithms.pal.html#hgbtc-label) classification algorithms due to SAP environment setup. But It will be really helpful if you share sampling strategy /idea / blueprint for such imbalanced dataset or do I need to develop something other way to resolve the issue. – Makarand Rayate Feb 23 '22 at 12:03

0 Answers0