Mapping categorical data from user input to its actual encoded value for prediction

Question

A portion of my dataset looks like this (there are many other processor types in my actual data)

df.head(4)
 Processor Task Difficulty Time
  i3        34    3         6
  i7        34    3         4
  i3        50    1         6
  i5        25    2         5

I have created a regression model to predict Time when Type, Task are Difficulty are given as inputs.

I have done label encoding first to change Processor which is categorical.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Processor'] = le.fit_transform(df['Processor'])


df.head(4)
 Processor Task Difficulty Time
  12        34    3         6
  8         34    3         4
  12        50    1         6
  2         25    2         5

This is my regression model

from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor(random_state = 1)
rf_model.fit(features,target)

I want to predict Time for the input "i5", 20, 1.

How can I do label encoding to "i5" to map it to get the same value as in my encoded dataframe in which i5 is encoded to 2?

I tried this

rf_model.predict([[le.fit_transform('i5'),20,1]])

However I got an output prediction different from the actual value when i5 is entered as 2,

rf_model.predict([[2,20,1)]])

why you are using LabelEncoder instead of OneHotEncoder or LableBinarizer. — Ehtisham Ahmed, Jan 06 '21 at 10:00

score 0 · Answer 1 · edited Jan 06 '21 at 19:18

0

You can try like this

print(le.fit_transform(['i5']))
# [2]

edited Jan 06 '21 at 19:18

piotrek1543

19,130
7
81
94

answered Jan 06 '21 at 10:07

Ehtisham Ahmed

387
3
15

I don't know why this is not working for me, I tried print(le.fit_transform(['i5'])) and print(le.fit_transform(['i7'])) both gave me the same output viz. # [0] – sebin Jan 06 '21 at 11:30
check your label classes `le.classes_ ` it Holds the label for each class. – Ehtisham Ahmed Jan 06 '21 at 12:14

score 0 · Accepted Answer · answered Jan 09 '21 at 05:32

It doesn't work because you are using fit_transform. This reassigns the categories instead of using the existing encoding, so if you do le.transform it should work. For example, something like your data:

np.random.seed(111)
df = pd.DataFrame({'Processor':np.random.choice(['i3','i5','i7'],50),
                  'Task':np.random.randint(25,50,50),
                  'Difficulty':np.random.randint(1,4,50),
                  'Time':np.random.randint(1,7,50)})

We make the target and feature, then fit :

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
features = df.iloc[:,:3]
features['Processor'] = le.fit_transform(features['Processor'])
target = df['Time']

from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor(random_state = 1)
rf_model.fit(features,target)

'i5' would be 1:

le.classes_
array(['i3', 'i5', 'i7'], dtype=object)

Check predictions:

rf_model.predict([[le.transform(['i5']),20,1]])

array([3.975])

And:

rf_model.predict([[1,20,1]])

array([3.975])

Mapping categorical data from user input to its actual encoded value for prediction

2 Answers2