3

I have trying to do 1-hot-encoding on a dataset using LabelEncoder and OneHotEncoder from sklearn by first LabelEncoding each column and then doing OneHotEncoding on the column. NOTE: I am purposefully making Row 1 of the dataframe for the two columns to be nan so that LabelEncoder won't give a missing on that.

Here is the code:

training_data.dropna(axis=1,how='any',inplace=True)
print training_data.shape
rows = [1]
training_data.loc[rows, endocing_columns] = float("nan")


print training_data.loc[1].mail_category 
print training_data.loc[1].mail_type 
for col in endocing_columns:
    label_encoder=LabelEncoder()
    oneHot_encoder=OneHotEncoder(sparse=False)
    label_encoder.fit(training_data[col])
    temp_col = pd.DataFrame(label_encoder.transform(training_data[col]))

    oneHot_encoder.fit(temp_col)
    temp = oneHot_encoder.transform(temp_col)
    print training_data.shape
    temp=pd.DataFrame(temp)
    training_data[col].value_counts().index])
    # In side by side concatenation index values should be same
    # Setting the index values similar to the training_data data frame
    temp=temp.set_index(training_data.index.values)
    # adding the new One Hot Encoded varibales to the train data frame
    training_data=pd.concat([training_data,temp],axis=1)
    training_data.drop(col, axis=1, inplace=True)

    print label_encoder.classes_
    temp_col = pd.DataFrame(label_encoder.transform(test_data[col]))
    temp = oneHot_encoder.transform(temp_col)

Here is the output for the code (Notice that in the printed classes of the label encoder, there is nan):

(478192, 46)
nan
nan
(478192, 46)
[nan 'mail_category_1' 'mail_category_10' 'mail_category_11'
 'mail_category_12' 'mail_category_13' 'mail_category_14'
 'mail_category_15' 'mail_category_16' 'mail_category_17'
 'mail_category_18' 'mail_category_2' 'mail_category_3' 'mail_category_4'
 'mail_category_5' 'mail_category_6' 'mail_category_7' 'mail_category_8'
 'mail_category_9']
Traceback (most recent call last):
  File "basic_analysis.py", line 46, in <module>
    temp_col = pd.DataFrame(label_encoder.transform(test_data[col]))
  File "/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/label.py", line 148, in transform
    raise ValueError("y contains new labels: %s" % str(diff))
ValueError: y contains new labels: [nan]
silent_dev
  • 1,566
  • 3
  • 20
  • 45

0 Answers0