Considering multiple independent categorical features in a data set, we want to encode multiple variables in each category. Should the dummy variables be different in each category? or is it reasonable to start the dummies in each category from 0? Consider the following example:
Distance_Group .... Airlines_with_HIGHEST_fare .......... dummies_1
============ .... ======================= ......... ========
G .......................... Atlantic Airways ................................. 0
A .......................... Bahamas Air ...................................... 1
B .......................... Bahamas Air ...................................... 1
C .......................... Jet Blue ............................................ 2
A .......................... United Airline ..................................... 3
Distance_group .... Airlines_with_LOWEST_fare ......... dummies_2
============ ....====================== ..........========
F ........................... Jet Blue .......................................... 0
E ........................... United Airline .................................. 1
A ........................... Lufthansa ........................................ 2
G .......................... Georgia Airways .............................. 3
Starting each category from 0, in first category, Jet Blue is corresponding to dummy variable: 2, in second one it is corresponding to dummy variable: 0.
Is this the right encoding for the two categories?
In case the query is needed for clarifying the example:
This Python query loops over all unique type categories while counting up.
map_dict1 = {}
for token, value in enumerate(Data['Airlines_with_HIGHEST_fare'].unique()):
map_dict1[value] = token
Data['Airlines_with_HIGHEST_fare'].replace(map_dict1, inplace=True)
The same logic also applies for the Airlines with lowest fare category for encoding airlines.
I am trying to cluster the airline fares, based on some numerical features like: Distance_Group, # passengers, etc. The above example is the two categorical features (= name of Airlines). All these features are input cells of a neural network, that's why they should be numerical. Because Neural Networks do not accept categorical variables.