-1

Professor wanted us to run some 10 fold cross validation on a data set to get the lowest RMSE and use the coefficients of that to make a function that takes in parameters and predicts and returns a "Fitness Factor" Score which ranges between 25-75.

He encouraged us to try transforming the data, so I did. I used scale() on the entire data set to standardize it and then ran my regression and 10 fold cross validation. I then found the model I wanted and copied the coefficients over. The problem is my function predictions are WAY off when i put unstandardized parameters into it to predict a y.

Did I completely screw this up by standardizing the data to a mean of 0 and sd of 1? Is there anyway I can undo this mess if I did screw up?

My coefficients are extremely small numbers and I feel like I did something wrong here.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • 2
    Maybe better off on https://stats.stackexchange.com/ – Roman Apr 23 '19 at 10:18
  • 5
    Obviously you need to inverse-transform your coefficients before applying them to the untransformed data (or you need to transform your data, apply the coefficients, and transform it back) – Konrad Rudolph Apr 23 '19 at 10:52
  • Have you transformed the test set based on the transformation coefficients computed on the train set? – Davide Visentin Apr 23 '19 at 16:14

1 Answers1

1

Build a proper pipeline, not just a hack with some R functions.

The problem is that you treat scaling as part of loading the data, not as part of the prediction process.

The proper protocol is as follows:

  1. "Learn" the transformation parameters
  2. Transform the training data
  3. Train the model
  4. Transform the new data
  5. Predict the value
  6. Inverse-transform the predicted value

During cross-validation these need to run separately for each fold, or you may overestimate (overfit) your quality.

Standardization is a linear transform, so the inverse is trivial to find.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194