1

I have a 63*62 training set and the class labels are also present. The test data is a 25*62 dimensions and has the class labels too. Given this how would I perform least squares regression? I am using the code:

res = lm(height~age)

what does height and age correspond to? When I have 61 features + 1 class (making it 62 columns for the training data) how would I input parameters?

Also how do I apply the model on the testing data?

user1403848
  • 103
  • 2
  • 4
  • 15

2 Answers2

2

If you have 62 columns you may want to use the more general formula

res = lm(height ~ . , data = mydata)

Notice how the period '.' represent the rest of the variables. But the previous answer is completely right in the sense that there are more variables than observations and therefore the answer (if there's any which shouldn't be) is completely useless.

Wilmer E. Henao
  • 4,094
  • 2
  • 31
  • 39
  • Thanks for the reply!! Now I have the best fit using: abline(res). How will I use this model on the testing data? – user1403848 Apr 10 '13 at 00:41
  • I'm not sure what you mean. abline(res) should plot a line of best fit. This only works in two dimensions as fas as I know. in case you want to substract the coefficients and multiply the test data you may want to use coefs <- coefficients(res) but be careful because the first one will be the intercept. However, almost everything you want for a basic regression will be displayed if you try: summary(res) – Wilmer E. Henao Apr 10 '13 at 01:00
  • I'm sorry but I was trying to ask whether by any way can the model generated using the training data be used in testing data? – user1403848 Apr 10 '13 at 01:08
  • Hmm. I think you may be looking for the function predict. In this case you would input something like library(utils) predict(res, newdata = yourdata[64:XX,]) – Wilmer E. Henao Apr 10 '13 at 01:19
1

height and age would be simply the labels of columns in your data frame. height is a predicted variable. You can have as many variables there as you wish: res = lm(height~age+wight+gender)

However, I must say that the question seems a bit strange to me because if you are performing a regression with 62 variables having 62 points in training set it will simply mean that you will always have an exact solution. Training set should always be (significantly) larger than the number of variables used.

sashkello
  • 17,306
  • 24
  • 81
  • 109