Categorical & Numerical Features - Categorical Target - Scikit Learn - Python

Question

I have a data set containing both categorical and numerical columns and my target column is also categorical. I am using Scikit library in Python34. I know that Scikit needs all categorical values to be transformed to numerical values before doing any machine learning approach.

How should I transform my categorical columns to numerical values? I tried a lot of thing but I am getting different errors such as "str" object has no 'numpy.ndarray' object has no attribute 'items'.

Here is an example of my data:
 UserID  LocationID   AmountPaid    ServiceID   Target
 29876      IS345       23.9876      FRDG        JFD
 29877      IS712       135.98       WERS        KOI

My dataset is saved in a CSV file, here is the little code I wrote to give you an idea about what I want to do:

#reading my csv file
data_dir = 'C:/Users/davtalab/Desktop/data/'
train_file = data_dir + 'train.csv'
train = pd.read_csv( train_file )

#numeric columns:
x_numeric_cols = train['AmountPaid']

#Categrical columns:
categorical_cols = ['UserID' + 'LocationID' + 'ServiceID']
x_cat_cols = train[categorical_cols].as_matrix() 


y_target = train['Target'].as_matrix()

I need x_cat_cols to be converted to numeric values and the add them to x_numeric_cols and so have my complete input (x) values.

Then I need to convert my target function into numeric value as well and make that as my final target (y) column.

Then I want to do a Random Forest using these two complete sets as:

rf = RF(n_estimators=n_trees,max_features=max_features,verbose =verbose, n_jobs =n_jobs)
rf.fit( x_train, y_train )

Thanks for your help!

score 4 · Answer 1 · answered May 17 '15 at 15:15

4

For target, you can use sklearn's LabelEncoder. This will give you a converter from string labels to numeric ones (and also a reverse mapping). Example in the link.

As for features, learning algorithms in general expect (or work best with) ordinal data. So the best option is to use OneHotEncoder to convert the categorical features. This will generate a new binary feature for each category, denoting on/off for each category. Again, usage example in the link.

answered May 17 '15 at 15:15

Ando Saabas

1,967
14
12

5

For the classification target, you actually don't need to use any transformation. All the classifiers can deal with arbitrary labels. – Andreas Mueller May 18 '15 at 16:12

score 0 · Accepted Answer · answered Jun 14 '17 at 18:09

This was because of the way I enumerate the data. If I print the data (using another sample) you will see:

>>> import pandas as pd
>>> train = pd.DataFrame({'a' : ['a', 'b', 'a'], 'd' : ['e', 'e', 'f'],
...                       'b' : [0, 1, 1], 'c' : ['b', 'c', 'b']})
>>> samples = [dict(enumerate(sample)) for sample in train]
>>> samples
[{0: 'a'}, {0: 'b'}, {0: 'c'}, {0: 'd'}]

This is a list of dicts. We should do this instead:

    >>> train_as_dicts = [dict(r.iteritems()) for _, r in train.iterrows()]
    >>> train_as_dicts
    [{'a': 'a', 'c': 'b', 'b': 0, 'd': 'e'},
     {'a': 'b', 'c': 'c', 'b': 1, 'd': 'e'},
     {'a': 'a', 'c': 'b', 'b': 1, 'd': 'f'}]
Now we need to vectorize the dicts:

>>> from sklearn.feature_extraction import DictVectorizer

>>> vectorizer = DictVectorizer()
>>> vectorized_sparse = vectorizer.fit_transform(train_as_dicts)
>>> vectorized_sparse
<3x7 sparse matrix of type '<type 'numpy.float64'>'
    with 12 stored elements in Compressed Sparse Row format>

>>> vectorized_array = vectorized_sparse.toarray()
>>> vectorized_array
array([[ 1.,  0.,  0.,  1.,  0.,  1.,  0.],
       [ 0.,  1.,  1.,  0.,  1.,  1.,  0.],
       [ 1.,  0.,  1.,  1.,  0.,  0.,  1.]])
To get the meaning of each column, ask the vectorizer:

>>> vectorizer.get_feature_names()
['a=a', 'a=b', 'b', 'c=b', 'c=c', 'd=e', 'd=f']

Categorical & Numerical Features - Categorical Target - Scikit Learn - Python

2 Answers2