1

I am given a test set without the response variable. I have already built the model and need to predict the response variable in the testing set.

I am having trouble formatting the test design matrix so that it would be compatible.

I am using patsy library to construct the matrix.

I want to do something like this, except the code below does not work:

X = dmatrices('Response ~ var1 + var2', test, return_type = 'dataframe')

What is the right approach? thanks

anticavity123
  • 111
  • 1
  • 9

1 Answers1

1

If you used patsy to fit the model in the first place, then you should tell it "hey, you know how you built my first design matrix? build me another the same way":

# Set up training data
train_Y, train_X = dmatrices("Response ~ ...", train, return_type="dataframe")
# Save patsy's record of how it built this matrix:
design_info = train_X.design_info
# Re-use it to build the test matrix
test_X = dmatrix(design_info, test, return_type="dataframe")

Alternatively, you could build a new matrix from scratch:

# Use 'dmatrix' and leave out the left-hand-side of the formula
test_X = dmatrix("~ ...", test, return_type="dataframe")

The first approach is better if you can do it. For example, suppose you have a categorical variable that you're letting patsy encode for you. And suppose that there are 10 categories that show up in your training set, but only 5 of them occur in your test set. If you use the first approach, then patsy will remember what the 10 categories where, and generate a test matrix with 10 columns (some of them all-zeros). If you use the second approach, then patsy will generate a training matrix with 10 columns and a test matrix with 5 columns, and then your model code is probably going to crash because the matrix isn't the shape it expects.

Another case where this matters is if you use patsy's center function to center a variable: with the first approach it will automatically remember what value it subtracted off from the training data and re-use it for the test data, which is what you want. With the second approach it will recompute the center using the test data, which can lead to you silently getting really really wrong results.

Nathaniel J. Smith
  • 11,613
  • 4
  • 41
  • 49