Lets start off first by writing down what you would expect (assuming you know what One Hot Encoding means)
unecoded
f0 f1 f2
0, 0, 3
1, 1, 0
0, 2, 1
1, 0, 2
encoded
|f0| | f1 | | f2 |
1, 0, 1, 0, 0, 0, 0, 0, 1
0, 1, 0, 1, 0, 1, 0, 0, 0
1, 0, 0, 0, 1, 0, 1, 0, 0
0, 1, 1, 0, 0, 0, 0, 1, 0
To get encoded:
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]),
if you use the default n_values='auto'
. In using default='auto' you're specifying that the values your features (columns of unencoded) could possibly take on can be inferred from the values in the columns of the data handed to fit
.
That brings us to enc.n_values_
from the docs:
Number of values per feature.
enc.n_values_
array([2, 3, 4])
The above means that f0 (column 1) can take on 2 values (0, 1), f1 can take on 3 values, (0, 1, 2) and f2 can take on 4 values (0, 1, 2, 3).
Indeed these are the values from the features f1, f2 ,f3 in the unencoded feature matrix.
then,
enc.feature_indices_
array([0, 2, 5, 9])
from the docs:
Indices to feature ranges. Feature i in the original data is mapped to
features from feature_indices_[i] to feature_indices_[i+1] (and then
potentially masked by active_features_ afterwards)
Given is the range of positions (in the encoded space) that features f1, f2, f3 can take on.
f1: [0, 1], f2: [2, 3, 4], f3: [5, 6, 7, 8]
Mapping the vector [0, 1, 1] into one hot encoded space (under the mapping by we got from enc.fit):
1, 0, 0, 1, 0, 0, 1, 0, 0
How?
The first feature in the f0 so that maps to position 0 (if the element was 1 instead of 0 we would map it into position 1).
The next element 1 maps into position 3 because f1 starts at position 2 and the element 1 is the second possible value f1 can take on.
Finally the third element 1 takes on position 6 since it the second possible value f2 takes on and f2 starts getting mapped from position 5.
Hope that clears up some stuff.