1

I'm a chemist and about an year ago I decided to know something more about chemometrics.

I'm working with this problem that I don't know how to solve:

I performed an experimental design (Doehlert type with 3 factors) recording several analyte concentrations as Y. Then I performed a PCA on Y and I used scores on the first PC (87% of total variance) as new y for a linear regression model with my experimental coded settings as X.

Now I need to perform a leave-one-out cross validation removing each object before perform the PCA on the new "training set", then create the regression model on the scores as I did before, predict the score value for the observation in the "test set" and calculate the error in prediction comparing the predicted score and the score obtained by the projection of the object in the test set in the space of the previous PCA. So repeated n times (with n the number of point of my experimental design). I'd like to know how can I do it with R.

Alex Kulinkovich
  • 4,408
  • 15
  • 46
  • 50
Ndr
  • 550
  • 3
  • 15
  • Hi Andrea, welcome to cross validated (and a special welcome to one more fellow chemist here!). However, your question is primarily about programming, not statistics (you already have your DoE and know that the cross validation needs to include the PCA). I'll therefore vote to migrate it over to stackoverflow, where programming questions are discussed. As you ask about R, I'll also add an R tag. – cbeleites unhappy with SX Feb 20 '13 at 20:23
  • Thanks @cbeleites but where can I find "stackoverflow"? I'm a noob! –  Feb 21 '13 at 00:09
  • @ndr: stackoverflow should have found you now... – cbeleites unhappy with SX Feb 21 '13 at 08:42

2 Answers2

2

Do the calculations e.g. by prcomp and then lm. For that you need to apply the PCA model returned by prcomp to new data. This needs two (or three) steps:

  1. Center the new data with the same center that was calculated by prcomp
  2. Scale the new data with the same scaling vector that was calculated by prcomp
  3. Apply the rotation calculated by prcomp

The first two steps are done by scale, using the $center and $scale elements of the prcomp object. You then matrix multiply your data by $rotation [, components.to.use]

You can easily check whether your reconstruction of the PCA scores calculation by calculating scores for the data you input to prcomp and comparing the results with the $x element of the PCA model returned by prcomp.

Edit in the light of the comment:

If the purpose of the CV is calculating some kind of error, then you can choose between calculating error of the predicted scores y (which is how I understand you) and calculating error of the Y: the PCA lets you also go backwards and predict the original variates from scores. This is easy because the loadings ($rotation) are orthogonal, so the inverse is just the transpose.

Thus, the prediction in original Y space is scores %*% t (pca$rotation), which is faster calculated by tcrossprod (scores, pca$rotation).

cbeleites unhappy with SX
  • 13,717
  • 5
  • 45
  • 57
  • Well, honestly I think that what I did is pretty different from PCR. In PCR I use (some) scores as **X** to fit and predict a **y**, while what I did is to use PCA scores on the first PC as **y** and my points from experimental design as **X**. –  Feb 21 '13 at 00:06
  • Sorry, I overlooked that. So what you do is somewhere between an (inverse) PCR and halfways to PLS? May I ask why you do this (pure curiosity)? In any case, the first method allows you to do whatever you like with the PCA scores. – cbeleites unhappy with SX Feb 21 '13 at 08:46
  • I performed a PCA because my original **Y** is a matrix of _n_ experiments and _m_ variables, so I had just two possibilities:
      - fit and characterize _m_ different models on every single **y_i** or - try to reduce the dimension of my **Y** data, as my _m_ variables are highly correlated I used the **Y** scores on the PC1 as my new **y**, say (**y_n**).
    Then I fitted just one model: **y_n~X** (**X** as I said the matrix of my coded experimental conditions). <\br> Now what need it's an estimation, obtained by cross validation, of the prediction error of my model.
    – Ndr Feb 21 '13 at 17:02
  • @Ndr: the multiple **y** ~ **X** models can be computed directly by one least squares fit of **Y** ~ **X**. A 3rd possibility is PLS, which regularizes the **Y** ~ **X** fit by doing PCA-like decomposition on both **X** and **Y** simultaneously. But why are your **Y** highly correlated? Wouldn't the Doehler DoE generate low correlation experiment matrices? – cbeleites unhappy with SX Feb 21 '13 at 18:27
  • the matrix of the experiments, **X**, has orthogonal columns and thus uncorrelated. The correlation is between columns of my response matrix **Y**. Every column of my **Y** is the instrumental response for a different analyte and fitting it with my **X** I founded the experimental setting in order to maximize the responses. I think that MLR works fine just for **y~X** not **Y~X**. I've never heard about PLS on orthogonal DoE but I think it could works also if the result would be less immediate to interpret. – Ndr Feb 21 '13 at 19:10
0

There is also R library pls (Partial Least Squares), which has tools for PCR (Principal Component Regression)

Stepan S. Sushko
  • 1,250
  • 1
  • 9
  • 5