-1

I am trying to onehotencode one column of my data frame and the remaining columns are label encoded. I am using the code as below:

def OneHotEncoder(repair,field):
    oe=preprocessing.OneHotEncoder()
    oe.fit(repair[field])
    np.save('/Users/sayontimondal/Desktop/SKlearn Model/Encoders/'+str(field)+'_enc_classes.npy', oe.classes_)
    repair[field] = repair[field].map(lambda s: 'Other' if s not in oe.classes_ else s)
    repair[field]=oe.transform(repair[field]) 
    return repair[field]

But when call the function on my data frame as: repair['SALES_ORG_ID']=OneHotEncoder(repair,'SALES_ORG_ID')

I get a value error: could not convert string to float: Other I do not understand why this happens as it works when I do the same thing with label encoders. Any idea what I am doing wrong?

I just want to save the encoder classes so that it can be reused in my validation set so any other way to do so would also be accepted.

CSBatchelor
  • 174
  • 6
sayo
  • 207
  • 4
  • 18

2 Answers2

0

Show the complete stack trace and some reproducible code and data so that we can check this. This seems like an easy to solve problem and can be done if you provide how you intend to solve it.

Other than that, there are multiple issues here:

1) OneHotEncoder cannot be used directly on strings directly. First you need to convert your string features to integers, (maybe using LabelEncoder)

2) One-hot encoding will transform your single column into multiple columns (depending on unique values in them), so you cannot assign it directly to a single column of your dataframe.

3) If you are able to successfully transform using OneHotEncoder, even then the data returned is a sparse matrix, which again does not go well with pandas dataframe.

4) You are assigning the same data to the same dataframe twice. Once inside method you are doing this:

repair[field]=oe.transform(repair[field]) 

And then you call the method like this:

repair['SALES_ORG_ID']=OneHotEncoder(repair,'SALES_ORG_ID')

This is un-necessary.

5) You are first fitting (or trying to fit) all the data in the field. So the oe.classes_ will contain all the unique categories. So after that, doing

repair[field] = repair[field].map(lambda s: 'Other' if s not in oe.classes_ else s)

dont make any sense. Can you show how you are doing this successfully for labelencoder as you said in the question? Now even if you somehow sucessfully do this, the next line:

repair[field]=oe.transform(repair[field]) 

will throw error because, 'Other' is a string which the OneHotEncoder dont handle. You need to add the extra category ('Other' in this case before fitting the data.

6) I would suggest you to save the transformers using joblib or pickle instead of numpy.

Note: As mentioned in changelog here, from next version (0.20.0), OneHotEncoder will be able to handle strings in passed data:

String or pandas Categorical columns can now be encoded with OneHotEncoder or OrdinalEncoder.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
0

I have concatenated to my existing data frame "Other" like this: Other = pd.DataFrame([['Other','Other','Other','Other','Other']], columns = ['CONFIG_CD','COMPONENT_CD_ISSUE_CD','SOLD_TO_SHIP_TO','SALES_ORG_ID','PART_NO']) repair = pd.concat([repair,Other]) After this I am doing the following for label encoder which works perfectly:

#label encoder def labelHotEncoder(repair,field): le = preprocessing.LabelEncoder() le.classes_= np.load('/Users/sayontimondal/Desktop/SKlearn Model/Encoders/'+str(field)+'_enc_classes.npy') #np.save('/Users/sayontimondal/Desktop/SKlearn Model/Encoders/'+str(field)+'_enc_classes.npy', le.classes_) repair[field] = repair[field].map(lambda s: 'Other' if s not in le.classes_ else s) repair[field]=le.transform(repair[field]) return repair[field] and then calling the function as below:

repair['CONFIG_CD']=labelHotEncoder(repair,'CONFIG_CD') repair['COMPONENT_CD_ISSUE_CD']=labelHotEncoder(repair,'COMPONENT_CD_ISSUE_CD') repair['SOLD_TO_SHIP_TO']=labelHotEncoder(repair,'SOLD_TO_SHIP_TO') repair['PART_NO']=labelHotEncoder(repair,'PART_NO')

sayo
  • 207
  • 4
  • 18