0

I am a data science beginner, and I now have built a Python data model.

In the data cleaning part, I had to drop some columns, add new columns, hash some columns into new columns, change some columns to numeric

For example (not from any real project):

Before (Original Data columns): Name (str), City (str), State (str), Status (str), Gender (str), Salary (float)

After : City_hash (int), State_hash (int), City_State_hash (int) - a combined City + State, Status (int) - target variable, Gender(int), Salary (float)

The model's name is my_model. I want to test it now by passing a numpy array to the model.

The steps are below :

features = np.array([[xx,xx,xx,xx,xx,...]])   where x are the values to pass

# using inputs to predict the output
prediction = my_model.predict(features)
print("Prediction: {}".format(prediction))

I just wanted to clarify what to put under features. Should the values be in the order of my "After" data - i.e. City_hash, State_hash, City_State_hash, etc.

If yes, what about those hashed values, like the state of California (CA) is no longer CA but a hashed value. Do I have to use the hashed value ?

Thanks for any info ...

The model I actually created and want to test, if interested : https://www.kaggle.com/josephramon/sba-xgboost-model

J R
  • 436
  • 3
  • 7

1 Answers1

0

Ok, I think I answered my own question through more testing. So I should input data in the same column order as the training dataset, but dropping the target variable.

For hashed columns, I needed to change my hash code to be reproducible in production. So if a user enters 'CA' for the state of California, my code should hash it exactly as it was hashed in the modeling data preparation.

As for one-hot encoded items, and other encodings, will just have to be able to reproduce them.

I also modified the model creation to split data into 3 - train:validation:test, instead of just train:valid. I can now use the test data, unseen previously, for testing and evaluating the metrics.

J R
  • 436
  • 3
  • 7