3

Considering multiple independent categorical features in a data set, we want to encode multiple variables in each category. Should the dummy variables be different in each category? or is it reasonable to start the dummies in each category from 0? Consider the following example:

Distance_Group .... Airlines_with_HIGHEST_fare .......... dummies_1
============ .... ======================= ......... ========
G .......................... Atlantic Airways ................................. 0
A .......................... Bahamas Air ...................................... 1
B .......................... Bahamas Air ...................................... 1
C .......................... Jet Blue ............................................ 2

A .......................... United Airline ..................................... 3

Distance_group .... Airlines_with_LOWEST_fare ......... dummies_2
============ ....====================== ..........========
F ........................... Jet Blue .......................................... 0
E ........................... United Airline .................................. 1
A ........................... Lufthansa ........................................ 2
G .......................... Georgia Airways .............................. 3

Starting each category from 0, in first category, Jet Blue is corresponding to dummy variable: 2, in second one it is corresponding to dummy variable: 0.

Is this the right encoding for the two categories?

In case the query is needed for clarifying the example:

This Python query loops over all unique type categories while counting up.

map_dict1 = {}

for token, value in enumerate(Data['Airlines_with_HIGHEST_fare'].unique()):

map_dict1[value] = token

Data['Airlines_with_HIGHEST_fare'].replace(map_dict1, inplace=True)

The same logic also applies for the Airlines with lowest fare category for encoding airlines.

I am trying to cluster the airline fares, based on some numerical features like: Distance_Group, # passengers, etc. The above example is the two categorical features (= name of Airlines). All these features are input cells of a neural network, that's why they should be numerical. Because Neural Networks do not accept categorical variables.

  • We need some more context--what analysis on the data are you trying to perform? With which toolkit? With what algorithm? – nanofarad Sep 01 '19 at 22:50
  • I am working with airline data, to classify airline_ticket_fare, based on some features like different types of airports and Distance_Miles etc. I am using Keras package in Python. Here the two categorical variables that I am trying to encode are names of airlines: "airlines_with_lowest_fare" and "airlines_with_highest_fare", in the same Origin-Destination group. – Neela Rahimi Sep 01 '19 at 23:31
  • Imagine in "lowest_fare", Jetblue (1st row) is 0, United_Airline(2nd row) is 1. Then, if we start encoding each category from 0, in "highest_fare", 1st row: Lufthansa is 0 and so on... Do I need to start the first category from 0 to n and the second category from n+1 to the end? – Neela Rahimi Sep 01 '19 at 23:56
  • Please [edit] the question to include these details, as well as the Keras classifiers/code you intend to use. – nanofarad Sep 02 '19 at 02:49

0 Answers0