1

What is the default rule used by sklearn OrdinaleEcoder to determine the order of the categories when categories='auto'?

Is it just sorted lexicographically? couldn't find it in the docs

nivniv
  • 3,421
  • 5
  • 33
  • 40

1 Answers1

0

Interesting question, let's try it out:

from sklearn.preprocessing import OrdinalEncoder
import numpy as np

# Create a simple dataset
data = np.array([['Medium'], ['High'], ['Low'], ['High'], ['Medium'], ['Low']])

# Create an instance of OrdinalEncoder
encoder = OrdinalEncoder(categories='auto')

# Fit and transform the data
encoded_data = encoder.fit_transform(data)

print("Encoded Data:", encoded_data)

print("\nCategories:", encoder.categories_)

# Check if the categories are sorted alphabetically
if (np.array(np.unique(data)) == encoder.categories_[0]).all():
    print("\nThe categories are sorted alphabetically.")
# Check if the categories are sorted by the order they appear in the input data
elif (np.array([x for i, x in enumerate(data[:, 0]) if x not in data[:, 0][:i]]) == encoder.categories_[0]).all():
    print("\nThe categories are sorted by the order they appear in the input data.")
else:
    print("\nThe ordering of categories doesn't match either alphabetical or input order.")

Note that np.unique() returns the sorted unique elements of an array and is therefore suitable here.

Output:

Encoded Data: [[2.]
 [0.]
 [1.]
 [0.]
 [2.]
 [1.]]

Categories: [array(['High', 'Low', 'Medium'], dtype='<U6')]

The categories are sorted alphabetically.
DataJanitor
  • 1,276
  • 1
  • 8
  • 19