Logistic regression - multiclass classification with categorical variables

Question

I'm currently working with a dataframe that has both categorical and continuous features, and looks like this:

I want to run a logistic regression to predict the target value. The target value in this case is race, which can either be "A", "W", "B", "H", "N", or "O", standing for "Asian", "White", "Black", "Hispanic", "Native American", or "Other".

I have turned all the features into dummy variables (except from the "race" column), in a new dataframe called "dummies". To train the model I use this code:

from sklearn import linear_model, metrics

X = dummies.drop("race", axis=1)
y = dummies["race"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)


from sklearn.linear_model import LogisticRegression

logmodel = LogisticRegression()
logmodel.fit(X_train, y_train)

predictions = logmodel.predict(X_test)

I don't get any errors, however, when I look at the classification matrix I get a perfect score of 1.00 for both precision, recall, and f1-score. Which seems a bit too good to be true... Am I doing something wrong?

This is the code I used to convert the dummies:

dummies = pd.get_dummies(df[["date", "armed", "age", "gender", "city", "state", "signs_of_mental_illness", "threat_level", "flee", "body_camera", "total_population"]], drop_first=True)
dummies = pd.concat([df, dummies], axis=1)

dummies.drop(df[["date", "armed", "age", "gender", "city", "state", "signs_of_mental_illness", "threat_level", "flee", "body_camera", "total_population"]], axis=1, inplace=True)

And the data? Please make a complete [mvce example](https://stackoverflow.com/help/mcve) — Vivek Kumar, Aug 29 '17 at 04:23

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

The reason you are getting a classification score of perfect 1.0 is because you are treating numerical data as categorical data. When you use pandas.get_dummies on all the columns of your dataframe, you are essentially converting all the dates,ages, etc. (i.e. numerical data) to dummy variable indicator which is incorrect. This is because in doing so, you are creating dummy variables for all ages in your dataset. For your small dataset it is possible to do it, but in real world scenario this means for age 1-100 you will have 100 different possible combinations!. The description for pandas.get_dummies is as follows :

Convert categorical variable into dummy/indicator variables

This is an incorrect way using classification. I suggest you only convert the categorical variables using pandas.get_dummies() and then verify your results. As for why you get 100% accuracy : it's because you are able to account for all possible scenarios by converting even the numerical columns into categorical types using this incorrect technique(since your dataset is small, this technique won't be much of a overload. However, for real-world scenario, it is incorrect).

If you want to check out some other ways to encode your data, check out this link.

Your data contains numerical columns too. Account for that, only then you will get correct results.

score 1 · Answer 2 · answered Aug 29 '17 at 07:45

You should use LabelEncoder to translate categorical features to numbers: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

For now you actually place target data (although in the different form) into train and test data, so you got perfect score - the model have only to translate dummy columns back to single column. It's 100% accurate, of course.

Also, look here: Multi-Class Logistic Regression in SciKit Learn

Logistic regression - multiclass classification with categorical variables

2 Answers2