0

Trying to encode data in a csv file. TA in class recommend LabelEncoder in sklearn. There's one column names education_level. And I need to encode it in "High, Medium, Low" order. But the LabelEncoder.fit_transform use ASCII code as default, which means it would encoder in "High, Low, Medium" order.

Found no methods to use self define order to encode it. Code attach below.

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

# load train.csv
df = pd.read_csv('./train.csv')
objfeatures = df.select_dtypes(include="object").columns
le = preprocessing.LabelEncoder()

# Use Label Encoder
# TODO 
# Any Better Way to encode the data? How to deal with missing values
for feat in objfeatures:
    df[feat] = le.fit_transform(df[feat].astype(str))

1 Answers1

0

You should use OrdinalEncoder and define the categories for each column using a list of arrays, see help page:

from sklearn.preprocessing import OrdinalEncoder

df = pd.DataFrame({'education_level':['High','Medium','Low','Medium'],
'var':['a','b','c','b']})

Here define the order in the first column followed by order in 2nd column:

oe = OrdinalEncoder(categories=[['High','Medium','Low'],['c','b','a']])

df

  education_level var
0            High   a
1          Medium   b
2             Low   c
3          Medium   b

oe.fit_transform(df)
 
array([[0., 2.],
       [1., 1.],
       [2., 0.],
       [1., 1.]])
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Sorry for late reply. I wonder what is the 2nd column stand for? To let it be a 1-1 function? Just read the help page but still couldn't figure it out. So in my case I need to copy the "education_level" and make it a 2d array then encode it? There's no other method to just transform the level to int only? Thank you very much! – ExcitedMail Nov 23 '21 at 03:33