1

Suppose I have a training data as below:

Age:12   Height:150   Weight:100     Gender:M
Age:15   Height:145   Weight:80      Gender:F
Age:17   Height:147   Weight:110     Gender:F
Age:11   Height:144   Weight:130     Gender:M

After I train my data and get the model, if I need to pass one test observation for prediction, do I need to send data with column names as below?

Age: 13   Height:142  Weight :90  

I some cases I have seen people sending test data in an array without the column names. I am not sure how algorithms work.

Note: I am using python scikit-learn and my training data is a dataFrame. So I am not sure whether my test data should also be in dataFrame format

R. Max
  • 6,624
  • 1
  • 27
  • 34
Gopal K
  • 11
  • 3

1 Answers1

0

Are you predicting gender?

If so, then yes. Your input are records with the columns: Age, Height and Weight.

Otherwise, you will be predicting on a record with a missing Gender value. You could get a KeyError if your model does not allow missing fields/columns.

I am not sure whether my test data should also be in dataFrame format

In short: yes.

Usually you do this:

# X is your input data, the format depends on how your model (pre)process the data.
# It could be a numeric matrix, a list of dict's, a list of strings, etc.
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Fit and validate.
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

So your train and test data are in the same format, or at least in compatible format (i.e.: a pandas dataframe is compatible with a list of dict's).

R. Max
  • 6,624
  • 1
  • 27
  • 34
  • Thanks Rolando.. Consider that I have trained the model with enough data using train_test_split or K-fold CV and done with the evaluation. Now I need to send only one test observation for prediction (say I am passing data from a front end tool), I need to find a way to put the input data from the user into corresponding column names as a dataFrame and pass it to the predict function right?? – Gopal K Jan 16 '16 at 15:43
  • @GopalK Without seeing your code is hard to tell whether you strictly need to pass a dataframe or not. But yes, the input for `predict` is an array of observations in the same format you passed to `fit`. – R. Max Jan 16 '16 at 19:49