0

I'm new to the Python ML using scikit. I was working on a solution to create a model with three columns Pets, Owner and location.

import pandas
import joblib
from sklearn.tree import DecisionTreeClassifier
from collections import defaultdict
from sklearn import preprocessing 

df = pandas.DataFrame({
    'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'], 
    'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'], 
    'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego', 
                 'New_York']
})

Now, with the label encoder I'm encoding the entire Data Frame.

le = preprocessing.LabelEncoder()
df_encoded = df.apply(le.fit_transform)
df_array=df_encoded.values

Now, I'm splitting the encoded array into Input set (Pets and Owner) and an Output set (location)

IpSet = df_array[:,0:2]
Opset = df_array[:,2:3]

Then, I create a new model of decision tree classifier and am fitting the input and output set.

model = DecisionTreeClassifier()
model.fit(IpSet,Opset)

Now, I'm trying to predict the Location using the model for a new Dataframe. I'm using the same Label encoder as used earlier.

df_Predict = pandas.DataFrame({
    'pets': ['cat'], 
    'owner': ['Champ']})
df_encoded_Predict = df_Predict.apply(le.fit_transform)
predictions_train = model.predict(df_encoded_Predict)
print(le.inverse_transform(predictions_train)[:1])

With this, I'm expecting to see the value 'San Diego'. Not sure, why I'm getting 'Champ' as an output.

Could someone help me through this?

ItsMeGokul
  • 403
  • 1
  • 5
  • 16
  • 2
    Don't `fit` transformers on your test data, you only call `fit` or `fit_transform` on the input. Then at the time of prediction, you call `trasform` with the fitted trasformer – G. Anderson Jan 18 '22 at 17:48
  • 1
    Also, you should be using `le.fit_transform(df)` not `df.apply(...)` – G. Anderson Jan 18 '22 at 17:49
  • @G.Anderson, I don't think I'm following you. 1) Could you give the logic for fit_transform only for input. Should I convert the df into input and opset even before label encoding? 2) When I do le.fit_transform(df), it works only on a 1d array. I'm trying to label encode the entire input set. – ItsMeGokul Jan 18 '22 at 18:00
  • I think that you should refer to the API of LabelEncoder, so that you can know how to apply the fit labels on the test data. – Qiyu Zhong Jan 18 '22 at 22:35

1 Answers1

0

The logic you following is not correct.

    df_encoded = df.apply(le.fit_transform)

Here the same encoder ( le ) fitted for every column and end of this line execution le has only the location information.

When you need to use already fitted encoder use the .transform() method instead of following.

       df_encoded_Predict = df_Predict.apply(le.fit_transform)
sulhi
  • 1
  • 1