3

I've been trying to make a prediction that consists of a DataFrame from the model I've made using the Decision Tree algorithm.

I have got the score for my model, which is 0.96. Then, I have tried to use the model to make a prediction from DataFrame people who stay but got an error. The goal is to predict people who will leave the company in the future based on DataFrame who stay.

How to achieve that goal?

So what I did is:

  1. Read the DF from my github and splitting them to people who left and not left
df = pd.read_csv('https://raw.githubusercontent.com/bhaskoro-muthohar/DataScienceLearning/master/HR_comma_sep.csv')

leftdf = df[df['left']==1]
notleftdf =df[df['left']==0]
  1. Preparing the data for Model Generation
df.salary = df.salary.map({'low':0,'medium':1,'high':2})
df.salary
X = df.drop(['left','sales'],axis=1)
y = df['left']
  1. splitting the train and test sets
import numpy as np
from sklearn.model_selection import train_test_split


#splitting the train and test sets
X_train, X_test, y_train, y_test= train_test_split(X,y,random_state=0, stratify=y)
  1. Train it
from sklearn import tree
clftree = tree.DecisionTreeClassifier(max_depth=3)
clftree.fit(X_train,y_train)
  1. Evaluating the model
y_pred = clftree.predict(X_test)
print("Test set prediction:\n {}".format(y_pred))
print("Test set score: {:.2f}".format(clftree.score(X_test, y_test)))

The result is

Test set score: 0.96

  1. Then I'm trying to Making a prediction using the DataFrame from people who not yet left the company
X_new = notleftdf.drop(['left','sales'],axis=1)

#Map salary to 0,1,2
X_new.salary = X_new.salary.map({'low':0,'medium':1,'high':2})
X_new.salary
prediction_will_left = clftree.predict(X_new)
print("Prediction: {}".format(prediction_will_left))
print("Predicted target name: {}".format(
    notleftdf['left'][prediction_will_left]
))

The error I got is:

KeyError: "None of [Int64Index([0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n            ...\n            0, 0, 0, 0, 0, 0, 1, 0, 0, 0],\n           dtype='int64', length=11428)] are in the [index]"

How to solve it?

PS: For full script link is here

ebuzz168
  • 1,134
  • 2
  • 17
  • 39
  • It is not clear what exactly you are trying to do. The error is obvious, that the value of index was not found. But please provide specific details as part of question in order to get quick help. – Supratim Haldar Oct 26 '19 at 11:42
  • @SupratimHaldar I'm sorry for not being clear, I have tried to use the model to make a prediction from DataFrame people who stay but got an error. The goal is to predict people who will leave the company in the future based on DataFrame who stay. – ebuzz168 Oct 26 '19 at 11:54
  • Please provide a [Minimal and Reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) (i.e. a bare minimum amount of code with which people here can reproduce the error that you're getting). – Xukrao Oct 26 '19 at 11:59
  • @Xukrao Already edited sir. – ebuzz168 Oct 26 '19 at 12:09

1 Answers1

2

Maybe you're looking for something like this. (Self-contained script once you download the data file to the same directory.)

from sklearn import tree
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd


def process_df_for_ml(df):
    """
    Process a dataframe for model training/prediction use.

    Returns X/y tensors.
    """

    df = df.copy()
    # Map salary to 0,1,2
    df.salary = df.salary.map({"low": 0, "medium": 1, "high": 2})
    # dropping left and sales X for the df, y for the left
    X = df.drop(["left", "sales"], axis=1)
    y = df["left"]
    return (X, y)

# Read and reindex CSV.
df = pd.read_csv("HR_comma_sep.csv")
df = df.reindex()

# Train a decision tree.
X, y = process_df_for_ml(df)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, stratify=y)
clftree = tree.DecisionTreeClassifier(max_depth=3)
clftree.fit(X_train, y_train)

# Test the decision tree on people who haven't left yet.
notleftdf = df[df["left"] == 0].copy()
X, y = process_df_for_ml(notleftdf)
# Plug in a new column with ones and zeroes from the prediction.
notleftdf["will_leave"] = clftree.predict(X)
# Print those with the will-leave flag on.
print(notleftdf[notleftdf["will_leave"] == 1])
AKX
  • 152,115
  • 15
  • 115
  • 172
  • Thanks! Muchas gracias. But, why you prefer to make function def, instead of writing it directly? – ebuzz168 Oct 26 '19 at 12:36
  • 1
    Because you're doing the same thing twice – first for the training DF, then for the not-left DF. This'll come in handy when you're attempting to bring a model like this to production. – AKX Oct 26 '19 at 14:47
  • Could one explain the meaning of `df = df.copy()` ? – Julio Nobre Feb 20 '22 at 00:34
  • 1
    @JulioNobre It takes a copy of the dataframe and assigns it to the same name. We do this since we don't want to mutate the object passed in. – AKX Feb 20 '22 at 09:57
  • Hi @AKX. You're saying thats the **df** overwriting thru a deep-clone is intentional in order to preserve **df** name (I gess, to prevent verbosity), while protecting the original calling dataframe data from being changed by the remaining inner function operations. That makes sense, although, in my opinion, **df_clone = df.copy()** is a clearer choice, specially for beginers (like me). Anyway, thanks! – Julio Nobre Feb 21 '22 at 00:48