0

I have a X_train dataframe. One of the columns locale has the unique values: ['Regional', 'Local', 'National'].

I am trying to make this column into an Ordered Categorical variable, with the correct order being from smallest to largest: ['Local', 'Regional', 'National'] = [0, 1, 2]

However, it is not working. Yes I saw the other threads about similar problems as mine, but those solutions are not working. I'm using factorize, but open to customizing the order of LabelEncoder too if that option exists now.

This is my code:

print(X_train['locale'][:10])
cat = pd.Categorical(X_train['locale'], categories = ['Local', 'Regional', 'National'])
codes, uniques = pd.factorize(cat)
print(codes[:10])

Output: (should be 2's if it is all national)

enter image description here

X_train dataframe:

{'id': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
 'date': {0: Timestamp('2013-01-01 00:00:00'),
  1: Timestamp('2013-01-01 00:00:00'),
  2: Timestamp('2013-01-01 00:00:00'),
  3: Timestamp('2013-01-01 00:00:00'),
  4: Timestamp('2013-01-01 00:00:00')},
 'store_nbr': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1'},
 'family': {0: 'AUTOMOTIVE',
  1: 'BABY CARE',
  2: 'BEAUTY',
  3: 'BEVERAGES',
  4: 'BOOKS'},
 'sales': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
 'onpromotion': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
 'city': {0: 'Quito', 1: 'Quito', 2: 'Quito', 3: 'Quito', 4: 'Quito'},
 'state': {0: 'Pichincha',
  1: 'Pichincha',
  2: 'Pichincha',
  3: 'Pichincha',
  4: 'Pichincha'},
 'store_type': {0: 'D', 1: 'D', 2: 'D', 3: 'D', 4: 'D'},
 'cluster': {0: '13', 1: '13', 2: '13', 3: '13', 4: '13'},
 'dcoilwtico': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
 'transactions': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
 'holiday_type': {0: 'Holiday',
  1: 'Holiday',
  2: 'Holiday',
  3: 'Holiday',
  4: 'Holiday'},
 'locale': {0: 'National',
  1: 'National',
  2: 'National',
  3: 'National',
  4: 'National'},
 'locale_name': {0: 'Ecuador',
  1: 'Ecuador',
  2: 'Ecuador',
  3: 'Ecuador',
  4: 'Ecuador'},
 'description': {0: 'Primer dia del ano',
  1: 'Primer dia del ano',
  2: 'Primer dia del ano',
  3: 'Primer dia del ano',
  4: 'Primer dia del ano'},
 'transferred': {0: False, 1: False, 2: False, 3: False, 4: False},
 'year': {0: '2013', 1: '2013', 2: '2013', 3: '2013', 4: '2013'},
 'month': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1'},
 'week': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1'},
 'quarter': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1'},
 'day_of_week': {0: 'Tuesday',
  1: 'Tuesday',
  2: 'Tuesday',
  3: 'Tuesday',
  4: 'Tuesday'}}
Nick ODell
  • 15,465
  • 3
  • 32
  • 66
Katsu
  • 8,479
  • 3
  • 15
  • 16

1 Answers1

1

Use

print(cat.codes)

instead. Using pd.factorize() re-factorizes the column, potentially with labels in a different order than when you initially created the categorical.

More documentation

Nick ODell
  • 15,465
  • 3
  • 32
  • 66