0

I am having imbalance in my data as shown below, Whenever I have tried with ADASYN it shows error, Do we need to provide any parameter entry for the same ? Some time it runs for long time but no response even after 40 minutes of code run.

                     counts  percentage
Enquiry Assigned      91284   75.902382
Test Drive Provided   25274   21.015258
Test Drive Arranged    3434    2.855361
Booked                  266    0.221178
Test Ride Provided        7    0.005820

Please suggest how We can go ahead with the python code to solve the problem. From others recommendation I heard like

  1. Can do sampling between two levels at once and then can do iteration on the same
  2. Downsamplig the one with 75% may be helpful ?
  3. or any solutions by using skmultilearn ?

Code:

def makeOverSamplesADASYN(X,y):

    #X →Independent Variable in DataFrame\
     #y →dependent Variable in Pandas DataFrame format
     from imblearn.over_sampling import ADASYN 
     sm = ADASYN(sampling_strategy='all', random_state=None, n_neighbors=5, n_jobs=1, ratio=None)
    
     X_adassin, y_adassin = sm.fit_resample(X, y)

 makeOverSamplesADASYN(X,data_dummyvar['Sales Stage'])

 print(X_adassin.shape)
 print(y_adassin.shape)'''   

o/p=== > This runs very long time and no result after that , please suggest

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Ayyasamy
  • 149
  • 1
  • 13

1 Answers1

0

I have downsampled the top entries once using the below code.

### " data_dummyvar " is my dataframe with the shape of (120265, 894)

df_majority=data_dummyvar[data_dummyvar['Sales Stage']=='Enquiry Assigned']
df_majority.shape
from sklearn.utils import resample

# Downsample majority class
df_majority_downsampled = resample(df_majority,replace=False,n_samples=25289,random_state=123)                                   
#replace: sample without replacement
# n_samples: to match minority class
#random_state: reproducible results
df_majority_downsampled.shape
df_minority=data_dummyvar[data_dummyvar['Sales Stage'] !='Enquiry Assigned']
df_minority['Sales Stage'].value_counts()
df_first_scaling = pd.concat([df_majority_downsampled,df_minority],ignore_index=True)
g = df_first_scaling['Sales Stage']
df = pd.concat([g.value_counts(),              
g.value_counts(normalize=True).mul(100)],axis=1, keys=('counts','percentage'))
print (df)

The above code will give result as below: o/p ===>>

                        counts  percentage
Enquiry Assigned      25289   46.598489
Test Drive Provided   25281   46.583748
Test Drive Arranged    3434    6.327621
Booked                  266    0.490142

the 'Enquiry Assigned' entries are down sampled here now.

Now We need to run SMOTE/ADASYN kind of algorithm to our data " df_first_scaling " two time since We have three more entries as shown below

def makeOverSamplesADASYN(X,y):
   #input DataFrame
   #X →Independent Variable in DataFrame\
   #y →dependent Variable in Pandas DataFrame format
   from imblearn.over_sampling import ADASYN 
   sm = ADASYN(sampling_strategy='minority', random_state=None, n_neighbors=5, n_jobs=1, ratio=None)
   global X_adassin_1
   global y_adassin_1
   X_adassin_1, y_adassin_1 = sm.fit_resample(X, y)

makeOverSamplesADASYN(X,df_first_scaling['Sales Stage']) # function call

print(X_adassin_1.shape)
print(y_adassin_1.shape)

This gives o/p which is shape as==>

(79334, 893)
(79334,) 

After running the same again over the updated data set We can get sample df with the shape of (101229, 893) & (101229,)

Ayyasamy
  • 149
  • 1
  • 13