Machine leaning OneHotEncoding in Python

Question

I am new to machine learning scikit-learn. I was going through the documentation and tried OneHotEncoder() with some sample data. Can someone please explain what is happening from encoder.feature_indices_ and how i get the output of Encoded_Vector as [[ 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]. Any help is appreciated. Thanks!

>>> from sklearn import preprocessing
>>> encoder = preprocessing.OneHotEncoder()
>>> encoder.fit([[0, 2, 1, 12], [1, 3, 5, 3], [2, 3, 2, 12], [1, 2, 4,3]])
    OneHotEncoder(categorical_features='all', dtype=<type 'numpy.float64'>,
   handle_unknown='error', n_values='auto', sparse=True)
>>> encoder.n_values_
array([ 3,  4,  6, 13])    
>>> encoder.feature_indices_
array([ 0,  3,  7, 13, 26])
>>> vector_encoded = encoder.transform([[2,3,5,3]]).toarray()
>>> print "\nEncoded_Vector =",vector_encoded
Encoded_Vector = [[ 0.  0.  1.  0.  1.  0.  0.  0.  1.  1.  0.]]
>>>

My understanding so far is

Input

0 2 1 12

1 3 5 3

2 3 2 12

1 2 4 3

This is 4 columns/features and 4 rows. Each column has different number of unique entities. If i run:

enc.n_values_

It gives: array([ 3, 4, 6, 13])

So categories for each feature are:

feature 1 can take 3 values : 0 1 2

feature 2 can take 4 values : 0 1 2 3

feature 3 can take 6 values : 0 1 2 3 4 5

feature 4 can take 13 values : 0 1 2 3 4 5 6 7 8 9 10 11 12

score 0 · Answer 1 · answered Oct 10 '17 at 18:21

Even though you said that your features can take a total of 3, 4, 6 or 13 values, the data example you provided ([[0, 2, 1, 12], [1, 3, 5, 3], [2, 3, 2, 12], [1, 2, 4,3]]) did not cover the complete variety of your data.

Your example basically says that:

feature 1 can take 3 values (0,1,2)
feature 2 can take 2 values (2,3)
feature 3 can take 4 values (1,2,4,5)
feature 4 can take 2 values (3,12)

This ends up with a total of 11 values. Thus, the output from the OneHotEncoding ([[ 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]) has 11 values, and it can be split into 4 sections:

[0. 0. 1.] is the encoding for feature 1
[0. 1.] is the encoding for feature 2
[0. 0. 0. 1.] is the encoding for feature 3
[1. 0.] is the encoding for feature 4

The position of the "1." in the array will tell you the value of the variable (try to match the example before encoding and after encoding).

Machine leaning OneHotEncoding in Python

1 Answers1