2

all.

This is the first question I make in this forum. I'am a beginner, as you all will immediately tell.

I´m doing a small task in which I must compare a training model with a test model. The point is that the training model has much more rows than the test one.

Through a linear regression model, I wish to compare both models, but when I use the predict() function I get the following error:

"newdata" had 3456 rows but variables found fave 7689 rows.

This is what I did:

regression = lm(train$students~train$subjects, train)

(train is the trining database)

prediction = predict(regression, test) 

(test is the testing database)

I don´t know if I´m using the predict() function correctly. Could someone tell me what I did wrong?

Thank you so much in advance your your help and kindness!

neilfws
  • 32,751
  • 5
  • 50
  • 63
albert
  • 37
  • 1
  • 4

1 Answers1

2

Don't refer to variables as data$var in a formula. Never. Ever.

What is happening is that you fitted a model with variables named train$students and train$subjects. To predict from the model R will look for a variable named train$subjects in the test set, test. Clearly no such variable exists in tests; who'd create variables with such silly names!? There is no need to use data$var format in a formula, because the whole point of the data argument is to indicate where R should lookup the names of variables mentioned in the formula.

To start fixing this, fit your model as:

regression <- lm(students ~ subjects, data = train)

then predict using

predict(regression, test)

where test will need to have a column containing subjects.

The error message is because newdata has 3456 but when it search for variable train$subjects it found 7689 rows, presumably that is the number of complete observations in train...?

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453