I am a data science beginner, and I now have built a Python data model.
In the data cleaning part, I had to drop some columns, add new columns, hash some columns into new columns, change some columns to numeric
For example (not from any real project):
Before (Original Data columns): Name (str), City (str), State (str), Status (str), Gender (str), Salary (float)
After : City_hash (int), State_hash (int), City_State_hash (int) - a combined City + State, Status (int) - target variable, Gender(int), Salary (float)
The model's name is my_model. I want to test it now by passing a numpy array to the model.
The steps are below :
features = np.array([[xx,xx,xx,xx,xx,...]]) where x are the values to pass
# using inputs to predict the output
prediction = my_model.predict(features)
print("Prediction: {}".format(prediction))
I just wanted to clarify what to put under features. Should the values be in the order of my "After" data - i.e. City_hash, State_hash, City_State_hash, etc.
If yes, what about those hashed values, like the state of California (CA) is no longer CA but a hashed value. Do I have to use the hashed value ?
Thanks for any info ...
The model I actually created and want to test, if interested : https://www.kaggle.com/josephramon/sba-xgboost-model