0

I am using Pipelines in Cross validations with SMOTE (imblearn library) for checking unbalanced dataset of fraud and non-fraud customers

gbm0 = GradientBoostingClassifier(random_state=10)

    samplers = [['SMOTE', SMOTE(random_state=RANDOM_STATE, ratio=0.5, kind='borderline1')]]
    classifier = ['gbm', gbm0]
    pipelines = [
        ['{}-{}'.format(sampler[0], classifier[0]),
         make_pipeline(sampler[1], classifier[1])]
        for sampler in samplers
    ]
    stdsc = StandardScaler()
    cv = StratifiedKFold(n_splits=3)
    mean_tpr = 0.0
    mean_fpr = np.linspace(0, 1, 100)
    Xstd = stdsc.fit_transform(X)
    scores = []
    confusion = np.array([[0, 0], [0, 0]])
    for name, pipeline in pipelines:
        mean_tpr = 0.0
        mean_fpr = np.linspace(0, 1, 100)
        for tr,ts in cv.split(Xstd, y):
            xtrain = Xstd[tr]
            ytrain = y[tr]
            test = y[ts]
            xtest = Xstd[ts]
            pipeline.fit(xtrain, ytrain)
            probas_ = pipeline.predict_proba(xtest)
            fpr, tpr, thresholds = roc_curve(test, probas_[:, 1])
            mean_tpr += interp(mean_fpr, fpr, tpr)
            mean_tpr[0] = 0.0
            roc_auc = auc(fpr, tpr)

            predictions = pipeline.predict(xtest)
            confusion += confusion_matrix(test, predictions)
            score = f1_score(test, predictions)
            scores.append(score)

        mean_tpr /= cv.get_n_splits(Xstd, y)
        mean_tpr[-1] = 1.0

I am able to get confusion matrix and ROC curve but I need exactly the precision and recall of the total, how should I go about doing it?

edit

I know that there is classification_report in scikit-learn but how can I use it for predictions made in CV?

sophros
  • 14,672
  • 11
  • 46
  • 75
Zubair Ahmed
  • 725
  • 8
  • 15
  • There is actually a method named [classification_report](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) in scikit. Maybe that can help. – Vivek Kumar Jun 14 '17 at 11:25
  • I know I have tried it but how would I get it to work for total predictions in cross validation? – Zubair Ahmed Jun 14 '17 at 11:31
  • I am not sure if I understand you correctly. You mean to calculate precision and recall like you calculated mean_tpr and mean_fpr? Then you can do it in same way you calculate f1_score here. – Vivek Kumar Jun 14 '17 at 11:51
  • Yea I need to calculate like I do f1 score. A help in syntax is appreciated – Zubair Ahmed Jun 14 '17 at 12:38
  • just use `recall_score(test, predictions)` and `precision_score(test, predictions)`. – Vivek Kumar Jun 14 '17 at 12:41
  • I am using a slightly modified version `precision, recall, fscore, support = score(test, predictions)` `recalls.append(recall)` `precisions.append(precision)` `scores.append(fscore)` then I am doing this `print('Score:', sum(scores) / len(scores))` `print('Recall:', sum(recalls) / len(recalls))` `print('Precision:', sum(precisions) / len(precisions))` – Zubair Ahmed Jun 14 '17 at 18:34

1 Answers1

1

So I ended up using

 
 from sklearn.metrics import precision_recall_fscore_support as score
 scores = []
 recalls = []
 precisions = []

 precision, recall, fscore, support = score(test, predictions)
 recalls.append(recall)
 recalls.append(recall)
 precisions.append(precision)

followed by

print('Score:', sum(scores) / len(scores))
Recall:', sum(recalls) / len(recalls))
Precision:', sum(precisions) / len(precisions))
Zubair Ahmed
  • 725
  • 8
  • 15