-3

If one of the features for my data set is a score that is categorical string like:

Score
X1c
X3a
X1a
X2b
X4
X1a
X1b
X4

Where X1a is the weakest followed by X1b, X1c, X2a, X2b ...X4 with X4 being the strongest, how can I encode it to integers such that X1a can be the lowest int and X4 be the highest int. I'm looking to use a random forest classifier. Also, the training set is a separate data set so this encoding should be maintained for new data sets.

Priya
  • 334
  • 3
  • 8
bloodynri
  • 543
  • 1
  • 6
  • 14
  • 2
    What have you tried to achieve this task? – xyzjayne Jul 18 '18 at 16:29
  • [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html), which in scikit-learn is used for this very thing orders data alphabetically as you need. – Vivek Kumar Jul 19 '18 at 09:22

1 Answers1

1

You can try using rank:

df['Score_int'] = df.Score.rank(method='dense')

Output:

  Score  Score_int
0   X1c        3.0
1   X3a        5.0
2   X1a        1.0
3   X2b        4.0
4    X4        6.0
5   X1a        1.0
6   X1b        2.0
7    X4        6.0
Scott Boston
  • 147,308
  • 15
  • 139
  • 187