0

I am new to machine learning scikit-learn. I was going through the documentation and tried OneHotEncoder() with some sample data. Can someone please explain what is happening from encoder.feature_indices_ and how i get the output of Encoded_Vector as [[ 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]. Any help is appreciated. Thanks!

>>> from sklearn import preprocessing
>>> encoder = preprocessing.OneHotEncoder()
>>> encoder.fit([[0, 2, 1, 12], [1, 3, 5, 3], [2, 3, 2, 12], [1, 2, 4,3]])
    OneHotEncoder(categorical_features='all', dtype=<type 'numpy.float64'>,
   handle_unknown='error', n_values='auto', sparse=True)
>>> encoder.n_values_
array([ 3,  4,  6, 13])    
>>> encoder.feature_indices_
array([ 0,  3,  7, 13, 26])
>>> vector_encoded = encoder.transform([[2,3,5,3]]).toarray()
>>> print "\nEncoded_Vector =",vector_encoded
Encoded_Vector = [[ 0.  0.  1.  0.  1.  0.  0.  0.  1.  1.  0.]]
>>>

My understanding so far is

Input

0 2 1 12

1 3 5 3

2 3 2 12

1 2 4 3

This is 4 columns/features and 4 rows. Each column has different number of unique entities. If i run:

enc.n_values_

It gives: array([ 3, 4, 6, 13])

So categories for each feature are:

feature 1 can take 3 values : 0 1 2

feature 2 can take 4 values : 0 1 2 3

feature 3 can take 6 values : 0 1 2 3 4 5

feature 4 can take 13 values : 0 1 2 3 4 5 6 7 8 9 10 11 12

Sreeram TP
  • 11,346
  • 7
  • 54
  • 108
Sumi
  • 157
  • 1
  • 3
  • 15

1 Answers1

0

Even though you said that your features can take a total of 3, 4, 6 or 13 values, the data example you provided ([[0, 2, 1, 12], [1, 3, 5, 3], [2, 3, 2, 12], [1, 2, 4,3]]) did not cover the complete variety of your data.

Your example basically says that:

  • feature 1 can take 3 values (0,1,2)
  • feature 2 can take 2 values (2,3)
  • feature 3 can take 4 values (1,2,4,5)
  • feature 4 can take 2 values (3,12)

This ends up with a total of 11 values. Thus, the output from the OneHotEncoding ([[ 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]) has 11 values, and it can be split into 4 sections:

  • [0. 0. 1.] is the encoding for feature 1
  • [0. 1.] is the encoding for feature 2
  • [0. 0. 0. 1.] is the encoding for feature 3
  • [1. 0.] is the encoding for feature 4

The position of the "1." in the array will tell you the value of the variable (try to match the example before encoding and after encoding).

TYZ
  • 8,466
  • 5
  • 29
  • 60