1

I am trying to predict one from two values which can appear in column 'exit'. I have clean data (about 20 columns and 4k rows contain typical information about customers like 'sex', 'age' ...). In training dataset about 20% customers were qualified as '1'. I made two models- svm and random forest- but both predict for test dataset mostly '0' (almost everytime). Recall of two models is 0. I atached code where I think I could do some stupid mistake. Any ideas why recall is so low during 80% accuracies?

def ml_model():
    print('sklearn: %s' % sklearn.__version__)
    df = pd.read_csv('clean_data.csv')
    df.head()
    feat = df.drop(columns=['target'], axis=1)
    label = df["target"]
    x_train, x_test, y_train, y_test = train_test_split(feat, label, test_size=0.3)
    sc_x = StandardScaler()
    x_train = sc_x.fit_transform(x_train)

    # SVC method
    support_vector_classifier = SVC(probability=True)
    # Grid search
    rand_list = {"C": stats.uniform(0.1, 10),
                 "gamma": stats.uniform(0.1, 1)}
    auc = make_scorer(roc_auc_score)
    rand_search_svc = RandomizedSearchCV(support_vector_classifier, param_distributions=rand_list, n_iter=100, n_jobs=4, cv=3, random_state=42,
                                     scoring=auc)
    rand_search_svc.fit(x_train, y_train)
    support_vector_classifier = rand_search_svc.best_estimator_
    cross_val_svc = cross_val_score(estimator=support_vector_classifier, X=x_train, y=y_train, cv=10, n_jobs=-1)
    print("Cross Validation Accuracy for SVM: ", round(cross_val_svc.mean() * 100, 2), "%")
    predicted_y = support_vector_classifier.predict(x_test)
    tn, fp, fn, tp = confusion_matrix(y_test, predicted_y).ravel()
    precision_score = tp / (tp + fp)
    recall_score = tp / (tp + fn)
    print("Recall score SVC: ", recall_score)


    # Random forests
    random_forest_classifier = RandomForestClassifier()
    # Grid search
    param_dist = {"max_depth": [3, None],
                  "max_features": sp_randint(1, 11),
                  "min_samples_split": sp_randint(2, 11),
                  "bootstrap": [True, False],
                  "criterion": ["gini", "entropy"]}
    rand_search_rf = RandomizedSearchCV(random_forest_classifier, param_distributions=param_dist,
                                       n_iter=100, cv=5, iid=False)
    rand_search_rf.fit(x_train, y_train)
    random_forest_classifier = rand_search_rf.best_estimator_
    cross_val_rfc = cross_val_score(estimator=random_forest_classifier, X=x_train, y=y_train, cv=10, n_jobs=-1)
    print("Cross Validation Accuracy for RF: ", round(cross_val_rfc.mean() * 100, 2), "%")
    predicted_y = random_forest_classifier.predict(x_test)
    tn, fp, fn, tp = confusion_matrix(y_test, predicted_y).ravel()
    precision_score = tp / (tp + fp)
    recall_score = tp / (tp + fn)
    print("Recall score RF: ", recall_score)

    new_data = pd.read_csv('new_data.csv')
    new_data = cleaning_data_to_predict(new_data)
    if round(cross_val_svc.mean() * 100, 2) > round(cross_val_rfc.mean() * 100, 2):
        predictions = support_vector_classifier.predict(new_data)
        predictions_proba = support_vector_classifier.predict_proba(new_data)
    else:
        predictions = random_forest_classifier.predict(new_data)
        predictions_proba = random_forest_classifier.predict_proba(new_data)

    f = open("output.txt", "w+")
    for i in range(len(predictions.tolist())):
        print("id: ", i, "probability: ", predictions_proba.tolist()[i][1], "exit: ", predictions.tolist()[i], file=open("output.txt", "a"))

2 Answers2

1

If I have not missed it, you forgot to scale your test set. So, you need to scale it as well. Note that you should just transform it, do not fit it again. See below.

x_test = sc_x.transform(x_test)
e_kapti
  • 61
  • 5
  • In which place should i put it? There where: x_test = sc_x.transform(x_test)? –  Nov 12 '19 at 21:35
  • Should I do the same with new_data (dataset where I wish to do prediction)? –  Nov 12 '19 at 21:46
  • Yes, all the data that this model will score needs to be scaled using the same scaler (sc_x) since all of your parameters are calculated according to the scaled training data. – e_kapti Nov 13 '19 at 22:05
0

I agree with @e_kapti, also check the formula of the recall and accuracy, you might consider using the F1 Score instead (https://en.wikipedia.org/wiki/F1_score).

Recall = TP / (TP+FN) Accuracy = (TP + TN) / (TP + TN + FP + FN) With TP, FP, TN, FN being number of true positives, false positives, true negatives and false negatives, respectively.

Bill Chen
  • 1,699
  • 14
  • 24