0

I am trying to use SMOTE in python and looking if there is any way to manually specify the number of minority samples.

Suppose we have 100 records of one class and 10 records of another class if we use ratio = 1 we get 100:100, if we use ratio 1/2, we get 100:200. But I am looking if there is any way to manually specify the number of instances to be generated for both the classes.

    Ndf_class_0_records = trainData[trainData['DIED'] == 0]
    Ndf_class_1_records = trainData[trainData['DIED'] == 1]
    Ndf_class_0_record_counts = Ndf_class_0_records.DIED.value_counts()
    Ndf_class_1_record_counts = Ndf_class_1_records.DIED.value_counts()
    X_smote = trainData.drop("DIED", axis=1)
    y_smote = trainData["DIED"]
    smt = SMOTE(ratio={0:Ndf_class_0_record_counts, 1:Ndf_class_1_record_counts*2})
    X_smote_res, y_smote_res = smt.fit_sample(X_smote, y_smote)

In the above code, I am trying to manually specify the number for each of the classes, but I am getting the following error at the last line of code

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • 1
    Which SMOTE implementation are you using? Can you show your imports? For example, [imblearn.over_sampling.SMOTE](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html#) – Myles Baker Jul 22 '19 at 16:57
  • Hi Baker, I am using the one which you have specified - from imblearn.over_sampling import SMOTE – Sindhura Bonthu Jul 24 '19 at 14:50

1 Answers1

0

If I understand you correctly and the documentation here, you are not passing numbers as ratio. You are instead passing a series object.

The accepted types for ratio are:

float, str, dict or callable, (default=’auto’)

Please try doing:

Ndf_class_0_records = trainData[trainData['DIED'] == 0]
Ndf_class_1_records = trainData[trainData['DIED'] == 1]
Ndf_class_0_record_counts = len(Ndf_class_0_records) ##### CHANGED THIS
Ndf_class_1_record_counts = len(Ndf_class_1_records) ##### CHANGED THIS
X_smote = trainData.drop("DIED", axis=1)
y_smote = trainData["DIED"]
smt = SMOTE(ratio={0:Ndf_class_0_record_counts, 1:Ndf_class_1_record_counts*2})
X_smote_res, y_smote_res = smt.fit_sample(X_smote, y_smote)

This should now work, please try!

Ankur Sinha
  • 6,473
  • 7
  • 42
  • 73