0

I have a dataset of which I have attached an image.Dataset

The set of unique values in Origin and Dest are same. Upon doing label encoding of those columns, I thought that value ATL will get same encoding in 'Origin' and 'Dest' but it turns out that the given code:

label_encoder = LabelEncoder()
flight_f['UniqueCarrier'] = label_encoder.fit_transform(flight_f['UniqueCarrier'])
flight_f['Origin'] = label_encoder.fit_transform(flight_f['Origin'])
flight_f['Dest'] = label_encoder.fit_transform(flight_f['Dest'])

Gives different encoding to a particular value in the two columns. And this is just the training set. I think in test set, I might get different values too which will hamper the predicitive analysis.

Can anyone suggest a solution, please?

Utkarsh A
  • 13
  • 1
  • 3

2 Answers2

0

Instead of applying a label encoder for each column like that, you probably want to try this

df.apply(LabelEncoder().fit)

And if you do fit_transform method, you probably will get a different encoding result that's why instead using fit_transform, you probably better use fit

here's the example

le = LabelEncoder()
# fit your training and test set
l_train = [1,2,3,4,5]
le.fit(l_train)
l_test [ 6, 7, 8]
le.fit(l_test)

le.transform(l_train)
# array([0, 1, 2, 3, 4], dtype=int64)
le.transform([2,3,4,5,6,7])
#array([1, 2, 3, 4, 5, 6], dtype=int64)
Felix Filipi
  • 373
  • 2
  • 12
0

I think what you need is "stack()":

from sklearn.preprocessing import LabelEncoder
import pandas as pd
label_encoder = LabelEncoder()
df = pd.DataFrame(data=[[8, "ATL", "DFW"], 
                        [9, "PIT", "ATL"],
                        [1, "DFW", "ATL"],
                        [5, "RDU", "CLE"]], columns=["Month", "Origin", "Dest"])

df
Month Origin Dest
8 ATL DFW
9 PIT ATL
1 DFW ATL
5 RDU CLE
label_encoder.fit(df[['Origin','Dest']].stack().unique())

df['Origin_encode'] = label_encoder.transform(df['Origin'])
df['Dest_encode'] = label_encoder.transform(df['Dest'])
df
Month Origin Dest Origin_encode Dest_encode
8 ATL DFW 0 2
9 PIT ATL 3 0
1 DFW ATL 2 0
5 RDU CLE 4 1
Jinhang Jiang
  • 138
  • 1
  • 7