What is the default rule used by sklearn
OrdinaleEcoder
to determine the order of the categories when categories='auto'
?
Is it just sorted lexicographically? couldn't find it in the docs
What is the default rule used by sklearn
OrdinaleEcoder
to determine the order of the categories when categories='auto'
?
Is it just sorted lexicographically? couldn't find it in the docs
Interesting question, let's try it out:
from sklearn.preprocessing import OrdinalEncoder
import numpy as np
# Create a simple dataset
data = np.array([['Medium'], ['High'], ['Low'], ['High'], ['Medium'], ['Low']])
# Create an instance of OrdinalEncoder
encoder = OrdinalEncoder(categories='auto')
# Fit and transform the data
encoded_data = encoder.fit_transform(data)
print("Encoded Data:", encoded_data)
print("\nCategories:", encoder.categories_)
# Check if the categories are sorted alphabetically
if (np.array(np.unique(data)) == encoder.categories_[0]).all():
print("\nThe categories are sorted alphabetically.")
# Check if the categories are sorted by the order they appear in the input data
elif (np.array([x for i, x in enumerate(data[:, 0]) if x not in data[:, 0][:i]]) == encoder.categories_[0]).all():
print("\nThe categories are sorted by the order they appear in the input data.")
else:
print("\nThe ordering of categories doesn't match either alphabetical or input order.")
Note that np.unique()
returns the sorted unique elements of an array and is therefore suitable here.
Output:
Encoded Data: [[2.]
[0.]
[1.]
[0.]
[2.]
[1.]]
Categories: [array(['High', 'Low', 'Medium'], dtype='<U6')]
The categories are sorted alphabetically.