I am new to machine learning scikit-learn
. I was going through the documentation and tried OneHotEncoder()
with some sample data. Can someone please explain what is happening from encoder.feature_indices_ and how i get the output of Encoded_Vector as [[ 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]
. Any help is appreciated. Thanks!
>>> from sklearn import preprocessing
>>> encoder = preprocessing.OneHotEncoder()
>>> encoder.fit([[0, 2, 1, 12], [1, 3, 5, 3], [2, 3, 2, 12], [1, 2, 4,3]])
OneHotEncoder(categorical_features='all', dtype=<type 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> encoder.n_values_
array([ 3, 4, 6, 13])
>>> encoder.feature_indices_
array([ 0, 3, 7, 13, 26])
>>> vector_encoded = encoder.transform([[2,3,5,3]]).toarray()
>>> print "\nEncoded_Vector =",vector_encoded
Encoded_Vector = [[ 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]
>>>
My understanding so far is
Input
0 2 1 12
1 3 5 3
2 3 2 12
1 2 4 3
This is 4 columns/features and 4 rows. Each column has different number of unique entities. If i run:
enc.n_values_
It gives: array([ 3, 4, 6, 13])
So categories for each feature are:
feature 1 can take 3 values : 0 1 2
feature 2 can take 4 values : 0 1 2 3
feature 3 can take 6 values : 0 1 2 3 4 5
feature 4 can take 13 values : 0 1 2 3 4 5 6 7 8 9 10 11 12