0

I'm working on an assignment and we are using OneHotEncoder in scikit-learn to make all categories print out. Here is the a sample of the data and the code I used to transform it:

      grade sub_grade  short_emp  emp_length_num home_ownership        term
0          B        B2          0              11           RENT   36 months
1          C        C4          1               1           RENT   60 months
2          C        C5          0              11           RENT   36 months
3          C        C1          0              11           RENT   36 months
4          A        A4          0               4           RENT   36 months
5          E        E1          0              10           RENT   36 months

Code:

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(categorical_features='all', handle_unknown='error', n_values='auto', sparse=True)
encoder.fit(lending_club)

The error I'm receiving is on the term column:

ValueError: could not convert string to float: ' 36 months'
macshaggy
  • 357
  • 1
  • 4
  • 17

2 Answers2

1

OneHotEncoder does not support string features. You have to convert them to integers before, using LabelEncoder for example. Another option would be to use LabelBinarizer on all columns.

See How to do Onehotencoding in Sklearn Pipeline.

Community
  • 1
  • 1
dukebody
  • 7,025
  • 3
  • 36
  • 61
  • Another question, I'm getting value error on trying to pass the DataFrame with a selection of more than one string column such as lending_club['grade', 'term']. Should I split the the DataFrame into two Frames? Or use the DataMapper so split the string data from the numeric data? – macshaggy Feb 16 '17 at 02:02
  • Can you create a new question in SO with all the info needed, please? – dukebody Feb 17 '17 at 09:39
0

scikit-learn's OneHotEncoder supports string from 0.20.0.

NiYanchun
  • 697
  • 8
  • 11