4

I have a model that runs the following:

import pandas as pd
import numpy as np

# initialize list of lists 
data = [['tom', 10,1,'a'], ['tom', 15,5,'a'], ['tom', 14,1,'a'], ['tom', 15,4,'b'], ['tom', 18,1,'b'], ['tom', 15,6,'a'], ['tom', 17,3,'a']
       , ['tom', 14,7,'b'], ['tom',16 ,6,'a'], ['tom', 22,2,'a'],['matt', 10,1,'c'], ['matt', 15,5,'b'], ['matt', 14,1,'b'], ['matt', 15,4,'a'], ['matt', 18,1,'a'], ['matt', 15,6,'a'], ['matt', 17,3,'a']
       , ['matt', 14,7,'c'], ['matt',16 ,6,'b'], ['matt', 10,2,'b']]

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Name', 'Attempts','Score','Category']) 

print(df.head(2))
  Name  Attempts  Score Category
0  tom        10      1        a
1  tom        15      5        a

Then I have created a dummy df to use in the model using the following code:

from sklearn.linear_model import LogisticRegression

df_dum = pd.get_dummies(df)
print(df_dum.head(2))
  Attempts  Score  Name_matt  Name_tom  Category_a  Category_b  Category_c
0        10      1          0         1           1           0           0
1        15      5          0         1           1           0           0

Then I have created the following model:

#Model

X = df_dum.drop(('Score'),axis=1)
y = df_dum['Score'].values

#Training Size
train_size = int(X.shape[0]*.7)
X_train = X[:train_size]
X_test = X[train_size:]
y_train = y[:train_size]
y_test = y[train_size:]


#Fit Model
model = LogisticRegression(max_iter=1000)
model.fit(X_train,y_train)


#Send predictions back to dataframe
Z = model.predict(X_test)
zz = model.predict_proba(X_test)

df.loc[train_size:,'predictions']=Z
dfpredictions = df.dropna(subset=['predictions'])

print(dfpredictions)
    Name  Attempts  Score Category  predictions
14  matt        18      1        a          1.0
15  matt        15      6        a          1.0
16  matt        17      3        a          1.0
17  matt        14      7        c          1.0
18  matt        16      6        b          1.0
19  matt        10      2        b          1.0

Now I have new data which i would like to predict:

newdata = [['tom', 10,'a'], ['tom', 15,'a'], ['tom', 14,'a']]

newdf = pd.DataFrame(newdata, columns = ['Name', 'Attempts','Category']) 

print(newdf)

 Name  Attempts Category
0  tom        10        a
1  tom        15        a
2  tom        14        a

Then create dummies and run prediction

newpredict = pd.get_dummies(newdf)

predict = model.predict(newpredict)

Output:

ValueError: X has 3 features per sample; expecting 6

Which makes sense because there are no categories b and c and no name called matt.

My question is how is the best way to set this model up given my new data wont always have the full set of columns used in the original data. Each day i have new data so I'm not quite sure of the most efficient and error free way.

This is an example data - my dataset has 2000 columns when running pd.get_dummies. Thanks very much!

BENY
  • 317,841
  • 20
  • 164
  • 234
SOK
  • 1,732
  • 2
  • 15
  • 33
  • https://stackoverflow.com/questions/51208115/error-predicting-x-has-n-features-per-sample-expecting-m – BENY Jun 07 '20 at 02:06
  • 2
    I remember a lengthy discussion on this site, in which people concluded that it's better to use `sklearn` one-hot encoder for this exact reason. – Nicolas Gervais Jun 07 '20 at 02:08
  • 1
    @NicolasGervais's got it. If you use the same OneHotEncoder object, it will remember how many total columns there were when you fit it for the first time. – BlueSkyz Jun 07 '20 at 03:23

1 Answers1

7

Let me explain Nicolas and BlueSkyz's recommendation a bit more in detail.

pd.get_dummies is useful when you are sure that there will not be any new categories for a specific categorical variable in production/new data set, e.g. Gender, Products, etc. based on your Company or Database's internal data classification/consistency rules.

However, for the majority of machine learning tasks where you can expect to have new categories in the future which were not used in model training, sklearn.OneHotEncoder should be the standard choice. handle_unknown parameter of sklearn.OneHotEncoder can be set to 'ignore' to do just that: ignore new categories when applying the encoder in future. From the documentation:

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None

The full flow based on LabelEncoding and OneHotEncoding for your example is as below:

# Create a categorical boolean mask
categorical_feature_mask = df.dtypes == object
# Filter out the categorical columns into a list for easy reference later on in case you have more than a couple categorical columns
categorical_cols = df.columns[categorical_feature_mask].tolist()

# Instantiate the OneHotEncoder Object
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse = False)
# Apply ohe on data
ohe.fit(df[categorical_cols])
cat_ohe = ohe.transform(df[categorical_cols])

#Create a Pandas DataFrame of the hot encoded column
ohe_df = pd.DataFrame(cat_ohe, columns = ohe.get_feature_names(input_features = categorical_cols))
#concat with original data and drop original columns
df_ohe = pd.concat([df, ohe_df], axis=1).drop(columns = categorical_cols, axis=1)

# The following code is for your newdf after training and testing on original df
# Apply ohe on newdf
cat_ohe_new = ohe.transform(newdf[categorical_cols])
#Create a Pandas DataFrame of the hot encoded column
ohe_df_new = pd.DataFrame(cat_ohe_new, columns = ohe.get_feature_names(input_features = categorical_cols))
#concat with original data and drop original columns
df_ohe_new = pd.concat([newdf, ohe_df_new], axis=1).drop(columns = categorical_cols, axis=1)

# predict on df_ohe_new
predict = model.predict(df_ohe_new)

Output (that you can assign back to newdf):

array([1, 1, 1])

However, if you really want to use pd.get_dummies only, then the following can work as well:

newpredict = newpredict.reindex(labels = df_dum.columns, axis = 1, fill_value = 0).drop(columns = ['Score'])
predict = model.predict(newpredict)

The above code snippet will make sure that you have the same columns in your new dummies df (newpredict) as the original df_dum (with 0 values) and drop the 'Score' column. Output here is same as above. This code will ensure that any categorical values present in the new data set but now in the original trained data will be removed while keeping the order of the columns same as that in the original df.

EDIT: One thing I forgot to add is that pd.get_dummies is usually much faster to execute than sklearn.OneHotEncoder

finlytics-hub
  • 164
  • 1
  • 9
  • Glad to be of help @SOK. Let me know if you face any issues while implementing it. – finlytics-hub Jun 07 '20 at 09:22
  • thanks @finlytics-hub. The only issue I am having at the moment is an `TypeError: ('argument must be a string or number'` when I am at the `ohe.fit(dfmodel[categorical_cols])` stage. Any ideas to debug? – SOK Jun 07 '20 at 12:06
  • @SOK Can you check the output of `categorical_cols`? It should be like this `['Name', 'Category']`. And I assume the `dfmodel` that you are referring in `ohe.fit` is exactly the same `df` that you defined in your original post? – finlytics-hub Jun 07 '20 at 12:13
  • yes sorry `dfmodel` is the same as `df` . When I debug it lists `categorical_cols` as you said and in my example it has about 7 columns (which are all categorical) so im not too sure why – SOK Jun 07 '20 at 12:17
  • (I am applying to my bigger model) which has more columns that the above question – SOK Jun 07 '20 at 12:18
  • hmmm..that's strange. Unfortunately I will not be able to help you out much without having access to your code and data. Maybe try it first on the smaller dataset that you had in your original post and then take it from there once you have a working example? – finlytics-hub Jun 07 '20 at 12:21
  • Yes thanks I will try that. In general any non `int` or `float` should appear in the `categorical_cols` output? – SOK Jun 07 '20 at 12:25
  • `categorical_cols` should have a list of all categorical variables in your df. In your original contrived df, there were only 2 categorical variables which I selected using this: `categorical_feature_mask = df.dtypes == object` assuming that all variables of `object` type would be categorical. If you want to, you can skip this step and manually assign all categorical variables like this: `categorical_cols = ['Name', 'Category', 'xyz', 'abc']`. – finlytics-hub Jun 07 '20 at 12:28
  • Great - thanks. I just tried on a smaller `df` and could reproduced the error if one of the non categorical columns has a non `int` value. So i might try and reassisgn the non categorical columns to `.astype(int)` or `.astype(float)`. Will let you know how i go! – SOK Jun 07 '20 at 12:32