I'm currently working with a dataframe that has both categorical and continuous features, and looks like this:
I want to run a logistic regression to predict the target value. The target value in this case is race, which can either be "A", "W", "B", "H", "N", or "O", standing for "Asian", "White", "Black", "Hispanic", "Native American", or "Other".
I have turned all the features into dummy variables (except from the "race" column), in a new dataframe called "dummies". To train the model I use this code:
from sklearn import linear_model, metrics
X = dummies.drop("race", axis=1)
y = dummies["race"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train, y_train)
predictions = logmodel.predict(X_test)
I don't get any errors, however, when I look at the classification matrix I get a perfect score of 1.00 for both precision, recall, and f1-score. Which seems a bit too good to be true... Am I doing something wrong?
This is the code I used to convert the dummies:
dummies = pd.get_dummies(df[["date", "armed", "age", "gender", "city", "state", "signs_of_mental_illness", "threat_level", "flee", "body_camera", "total_population"]], drop_first=True)
dummies = pd.concat([df, dummies], axis=1)
dummies.drop(df[["date", "armed", "age", "gender", "city", "state", "signs_of_mental_illness", "threat_level", "flee", "body_camera", "total_population"]], axis=1, inplace=True)