Can encode categorical data in train set but not in the test set

Question

I need to encode the categorical values on my test set, somehow it throws TypeError: argument must be a string or number. I do not know why this happens because i could do it to my train set. I mean they're train/test feature set so they're exactly the same, what differentiates them is just the number of the rows of course. I do not know how to fix this, i have tried to use different LabelEncoder for each, but it still does not fix the error. Please someone help me.

For your information the categorical data is on the column 8th in both train and test features set

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
import scipy.stats as ss

avo_sales = pd.read_csv('avocados.csv')

avo_sales.rename(columns = {'4046':'small PLU sold',
                            '4225':'large PLU sold',
                            '4770':'xlarge PLU sold'},
                 inplace= True)

avo_sales.columns = avo_sales.columns.str.replace(' ','')

x = np.array(avo_sales.drop(['TotalBags','Unnamed:0','year','region','Date'],1))
y = np.array(avo_sales.TotalBags)

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

impC = SimpleImputer(strategy='most_frequent')
X_train[:,8] = impC.fit_transform(X_train[:,8].reshape(-1,1)).ravel()

imp = SimpleImputer(strategy='median')
X_train[:,1:8] = imp.fit_transform(X_train[:,1:8])

le = LabelEncoder()
X_train[:,8] = le.fit_transform(X_train[:,8])
X_test[:,8] = le.fit_transform(X_test[:,8])

Edoardo Guerriero · Accepted Answer · 2020-03-13T11:24:59.893

On the test set you should never use fit_transform, but only transform. And it seems that you're not applying the preprocessing you did on the training data to your test data, that is also a mistake.

EDIT

When you use fit_transform, for example SimpleImputer(strategy='most_frequent') on your training data, you're basically calculating the most frequent value, to input it in the rows containing nan. This is fine. If you do fit_transform on your test set what you're doing is cheating, because you're assuming to have lot of instances from which calculate the most frequent value (whereas instead you might be predicting only one instance). The right thing to do is to input the missing data using the most frequent value you found on the training set. This is done by using only transform. The same logic apply to every other fit_transform / transform you can find in sklearn, for example when applying PCA or a CountVectorizer.

firstly thank you for answering and explaining, and excuse me for asking, could you explain why is `On the test set you should never use fit_transform, but only transform`? it worked but i still do not get why it could fix the issue — random student, Mar 13 '20 at 11:16

Can encode categorical data in train set but not in the test set

1 Answers1