89

Given is a simple CSV file:

A,B,C
Hello,Hi,0
Hola,Bueno,1

Obviously the real dataset is far more complex than this, but this one reproduces the error. I'm attempting to build a random forest classifier for it, like so:

cols = ['A','B','C']
col_types = {'A': str, 'B': str, 'C': int}
test = pd.read_csv('test.csv', dtype=col_types)

train_y = test['C'] == 1
train_x = test[cols]

clf_rf = RandomForestClassifier(n_estimators=50)
clf_rf.fit(train_x, train_y)

But I just get this traceback when invoking fit():

ValueError: could not convert string to float: 'Bueno'

scikit-learn version is 0.16.1.

nilkn
  • 935
  • 1
  • 7
  • 8
  • how about converting string column to factor type such as `df['zipcode'] = df['zipcode'].astype('category')` – LeMarque Jan 27 '20 at 14:44

8 Answers8

105

You have to do some encoding before using fit(). As it was told fit() does not accept strings, but you solve this.

There are several classes that can be used :

  • LabelEncoder : turn your string into incremental value
  • OneHotEncoder : use One-of-K algorithm to transform your String into integer

Personally, I have post almost the same question on Stack Overflow some time ago. I wanted to have a scalable solution, but didn't get any answer. I selected OneHotEncoder that binarize all the strings. It is quite effective, but if you have a lot of different strings the matrix will grow very quickly and memory will be required.

VLAZ
  • 26,331
  • 9
  • 49
  • 67
RPresle
  • 2,436
  • 3
  • 24
  • 28
  • 7
    Thanks. I eventually found a solution using DictVectorizer. I'm kind of surprised there isn't better documentation of dealing with issues like this. I'd upvote if I had enough karma here. – nilkn May 27 '15 at 13:44
  • 1
    In decision tree having label encoder is okay? it wouldn't judge 1 < 2 < 3 and so on?? – haneulkim Aug 30 '21 at 09:19
  • @haneulkim It would. This is not the correct answer. And one-hot encoding is also suboptimal because the random forest training algorithm won't know to split between different sets of categories where both sets have cardinality > 1 (it can only split on one category vs. the rest), so it won't split on those features optimally. There is no solution in sklearn as of this comment. See here: stackoverflow.com/a/24715300/6238166 And here: github.com/scikit-learn/scikit-learn/pull/12866 – Joe Silverstein May 09 '23 at 02:57
22

LabelEncoding worked for me (basically you've to encode your data feature-wise) (mydata is a 2d array of string datatype):

myData=np.genfromtxt(filecsv, delimiter=",", dtype ="|a20" ,skip_header=1);

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for i in range(*NUMBER OF FEATURES*):
    myData[:,i] = le.fit_transform(myData[:,i])
SinOfWrath
  • 315
  • 2
  • 8
19

I had a similar issue and found that pandas.get_dummies() solved the problem. Specifically, it splits out columns of categorical data into sets of boolean columns, one new column for each unique value in each input column. In your case, you would replace train_x = test[cols] with:

train_x = pandas.get_dummies(test[cols])

This transforms the train_x Dataframe into the following form, which RandomForestClassifier can accept:

   C  A_Hello  A_Hola  B_Bueno  B_Hi
0  0        1       0        0     1
1  1        0       1        1     0
pittsburgh137
  • 206
  • 2
  • 4
11

You can't pass str to your model fit() method. as it mentioned here

The training input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csc_matrix.

Try transforming your data to float and give a try to LabelEncoder.

farhawa
  • 10,120
  • 16
  • 49
  • 91
  • 5
    Huh, how is it that there are examples which clearly use string data? I guess they're outdated or something? – nilkn May 21 '15 at 21:57
  • For instance: http://nbviewer.ipython.org/github/ofermend/IPython-notebooks/blob/master/blog-part-1.ipynb – nilkn May 21 '15 at 21:59
  • 2
    So what is the canonical way of dealing with this? There's no way I'm the first person to try to do this with scikit-learn. – nilkn May 21 '15 at 22:37
11

You may not pass str to fit this kind of classifier.

For example, if you have a feature column named 'grade' which has 3 different grades:

A,B and C.

you have to transfer those str "A","B","C" to matrix by encoder like the following:

A = [1,0,0]

B = [0,1,0]

C = [0,0,1]

because the str does not have numerical meaning for the classifier.

In scikit-learn, OneHotEncoder and LabelEncoder are available in inpreprocessing module. However OneHotEncoder does not support to fit_transform() of string. "ValueError: could not convert string to float" may happen during transform.

You may use LabelEncoder to transfer from str to continuous numerical values. Then you are able to transfer by OneHotEncoder as you wish.

In the Pandas dataframe, I have to encode all the data which are categorized to dtype:object. The following code works for me and I hope this will help you.

 from sklearn import preprocessing
    le = preprocessing.LabelEncoder()
    for column_name in train_data.columns:
        if train_data[column_name].dtype == object:
            train_data[column_name] = le.fit_transform(train_data[column_name])
        else:
            pass
jo nova
  • 111
  • 1
  • 3
7

Well, there are important differences between how OneHot Encoding and Label Encoding work :

  • Label Encoding will basically switch your String variables to int. In this case, the 1st class found will be coded as 1, the 2nd as 2, ... But this encoding creates an issue.

Let's take the example of a variable Animal = ["Dog", "Cat", "Turtle"].

If you use Label Encoder on it, Animal will be [1, 2, 3]. If you parse it to your machine learning model, it will interpret Dog is closer than Cat, and farther than Turtle (because distance between 1 and 2 is lower than distance between 1 and 3).

Label encoding is actually excellent when you have ordinal variable.

For example, if you have a value Age = ["Child", "Teenager", "Young Adult", "Adult", "Old"],

then using Label Encoding is perfect. Child is closer than Teenager than it is from Young Adult. You have a natural order on your variables

  • OneHot Encoding (also done by pd.get_dummies) is the best solution when you have no natural order between your variables.

Let's take back the previous example of Animal = ["Dog", "Cat", "Turtle"].

It will create as much variable as classes you encounter. In my example, it will create 3 binary variables : Dog, Cat and Turtle. Then if you have Animal = "Dog", encoding will make it Dog = 1, Cat = 0, Turtle = 0.

Then you can give this to your model, and he will never interpret that Dog is closer from Cat than from Turtle.

But there are also cons to OneHotEncoding. If you have a categorical variable encountering 50 kind of classes

eg : Dog, Cat, Turtle, Fish, Monkey, ...

then it will create 50 binary variables, which can cause complexity issues. In this case, you can create your own classes and manually change variable

eg : regroup Turtle, Fish, Dolphin, Shark in a same class called Sea Animals and then appy a OneHotEncoding.

mw509
  • 1,957
  • 1
  • 19
  • 25
Adept
  • 522
  • 3
  • 16
0

As your input is in string you are getting value error message use countvectorizer it will convert data set in to sparse matrix and train your ml algorithm you will get the result

raghu
  • 11
  • 1
  • Hi raghu. You could try to improve this answer by providing sample code, or sample input-output. This can help the person who is asking the question how to understand your answer, which is ultimately what an answer is supposed to do. – dsapalo Feb 01 '20 at 05:39
  • after splitting the data into test and train count_vectorizer = CountVectorizer() X_count = count_vectorizer.fit_transform(x_train) neigh=KNeighborsClassifier(n_neighbors=1,weights='uniform',algorithm='brute') neigh.fit(X_count ,y_train_bow) – raghu Feb 02 '20 at 12:57
-1

Indeed a one-hot encoder will work just fine here, convert any string and numerical categorical variables you want into 1's and 0's this way and random forest should not complain.