How to do create dummy variables for prediction from user input (only one record)?

Question

I am trying to create a web application for predicting airline delays. I have trained my model offline on my computer, and now am trying to make a Flask app to make predictions based on user input. For simplicity, lets say my model has 3 categorical variables: UNIQUE_CARRIER, ORIGIN and DESTINATION. While training, I create dummy variables of all 3 using pandas:

df = pd.concat([df, pd.get_dummies(df['UNIQUE_CARRIER'], drop_first=True, prefix="UNIQUE_CARRIER")], axis=1)
df = pd.concat([df, pd.get_dummies(df['ORIGIN'], drop_first=True, prefix="ORIGIN")], axis=1)
df = pd.concat([df, pd.get_dummies(df['DEST'], drop_first=True, prefix="DEST")], axis=1)
df.drop(['UNIQUE_CARRIER', 'ORIGIN', 'DEST'], axis=1, inplace=True)

So now my feature vector is 297 long (assuming there are 100 different unique carriers and 100 different airports in my data). I saved my model using pickle, and now am trying to predict based on user input. Now the user input is in the form of 3 variables (origin, destination, carrier).

Obviously I cannot use pd.get_dummies (because there would be only 1 unique value for all the three fields) for each user input. What is the most efficient way to convert the user input into the feature vector for my model?

Can you post what the head of `df` looks like after you processing as shown above? — Alex, Jan 08 '17 at 20:37
I would suggest using [scikit-learn's `OneHotEncoder`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) instead of `get_dummies`. With this method you will build an object that can be used to transform new data. — Alex, Jan 08 '17 at 20:48

score 1 · Accepted Answer · answered Jan 08 '17 at 22:30

Since you are using pandas dummies and hence dense vectors, a good way to create a new vector would be to create a dict of terms:vector_index and then populate a zeros vector according to it, something along the lines of the following:

index_dict = dict(zip(df.columns,range(df.shape[1])))

now when you have a new flight:

new_vector = np.zeroes(297)
try:
    new_vector[index_dict[origin]] = 1
except:
    pass
try:
    new_vector[index_dict[destination]] = 1
except:
    pass
try:
    new_vector[index_dict[carrier]] = 1
except:
    pass

How to do create dummy variables for prediction from user input (only one record)?

1 Answers1