Did I screw up my entire data science homework assignment by standardizing my data?

Question

Professor wanted us to run some 10 fold cross validation on a data set to get the lowest RMSE and use the coefficients of that to make a function that takes in parameters and predicts and returns a "Fitness Factor" Score which ranges between 25-75.

He encouraged us to try transforming the data, so I did. I used scale() on the entire data set to standardize it and then ran my regression and 10 fold cross validation. I then found the model I wanted and copied the coefficients over. The problem is my function predictions are WAY off when i put unstandardized parameters into it to predict a y.

Did I completely screw this up by standardizing the data to a mean of 0 and sd of 1? Is there anyway I can undo this mess if I did screw up?

My coefficients are extremely small numbers and I feel like I did something wrong here.

Obviously you need to inverse-transform your coefficients before applying them to the untransformed data (or you need to transform your data, apply the coefficients, and transform it back) — Konrad Rudolph, Apr 23 '19 at 10:52
Have you transformed the test set based on the transformation coefficients computed on the train set? — Davide Visentin, Apr 23 '19 at 16:14

score 1 · Answer 1 · answered Apr 23 '19 at 21:51

Build a proper pipeline, not just a hack with some R functions.

The problem is that you treat scaling as part of loading the data, not as part of the prediction process.

The proper protocol is as follows:

"Learn" the transformation parameters
Transform the training data
Train the model
Transform the new data
Predict the value
Inverse-transform the predicted value

During cross-validation these need to run separately for each fold, or you may overestimate (overfit) your quality.

Standardization is a linear transform, so the inverse is trivial to find.

Did I screw up my entire data science homework assignment by standardizing my data?

1 Answers1