5

I am trying to predict test reuslts based on known previous scores. The test is made up of three subjects, each contributing to the final exam score. For all students I have their previous scores for mini-tests in each of the three subjects, and I know which teacher they had. For half of the students (the training set) I have their final score, for the other half I don't (the test set). I want predict their final score.

So the test set looks like this:

student teacher subject1score subject2score subject3score finalscore

while the test set is the same but without the final score

student teacher subject1score subject2score subject3score 

So I want to predict the final score of the test set students. Any ideas for a simple learning algorithm or statistical technique to use?

Harry Palmer
  • 468
  • 1
  • 6
  • 17
  • Any predictor based solely on past scores is not going to be accurate because it doesn't take into account whether the students studied differently for the upcoming test or not, and whether the teacher prepared them differently. There are many other variables also. But if you're just wanting to find some mathematical series/sequence to the scores that's a different question. Is that what you're asking for? – Jonathan M Apr 17 '12 at 15:18
  • I'm not so worried about accuracy, it's more about the logic - finding a good technique for this class of problem. I think the key issue is modelling what effect each teacher has on the students, for each of the three subjects. Any ideas? – Harry Palmer Apr 17 '12 at 15:30
  • @David Robinson: Your answer is more appropriate in the context. So I give you a +1 and exit :) [Deleted my answer] – Yavar Apr 17 '12 at 16:54

1 Answers1

6

The simplest and most reasonable method to try is a linear regression, with the teacher and the three scores used as predictors. (This is based on the assumption that the teacher and the three test scores will each have some predictive ability towards the final exam, but they could contribute differently- for example, the third test might matter the most).

You don't mention a specific language, but let's say you loaded it into R as two data frames called 'training.scoresandtest.scores`. Fitting the model would be as simple as using lm:

lm.fit = lm(finalscore ~ teacher + subject1score + subject2score + subject3score, training.scores)

And then the prediction would be done as:

predicted.scores = predict(lm.fit, test.scores)

Googling for "R linear regression", "R linear models", or similar searches will find many resources that can help. You can also learn about slightly more sophisticated methods such as generalized linear models or generalized additive models, which are almost as easy to perform as the above.

ETA: There have been books written about the topic of interpreting linear regression- an example simple guide is here. In general, you'll be printing summary(lm.fit) to print a bunch of information about the fit. You'll see a table of coefficients in the output that will look something like:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -14.4511     7.0938  -2.037 0.057516 .  
setting       0.2706     0.1079   2.507 0.022629 *  
effort        0.9677     0.2250   4.301 0.000484 ***

The Estimate will give you an idea how strong the effect of that variable was, while the p-values (Pr(>|T|)) give you an idea whether each variable actually helped or was due to random noise. There's a lot more to it, but I invite you to read the excellent resources available online.

Also plot(lm.fit) will graphs of the residuals (residuals mean the amount each prediction is off by in your testing set), which tells you can use to determine whether the assumptions of the model are fair.

David Robinson
  • 77,383
  • 16
  • 167
  • 187
  • Thanks this will help a lot - I'll try linear regression and linear models. Just so I can understand the problem better, do you know of any alternative approaches I could also read up on? – Harry Palmer Apr 17 '12 at 16:30
  • You could also try nearest-neighbor or n-nearest neighbor. This would consist of (for example) finding a student who had the same teacher and relatively similar test grades, and looking at his score (n-nearest neighbor would find the n closest students and take the average). However, I think you'll have more luck with linear methods and other models in [regression analysis](http://en.wikipedia.org/wiki/Regression_analysis). Generalized additive models and nonparametric regression would be among the most complicated ones that are worth trying. – David Robinson Apr 17 '12 at 16:48
  • Just realised that (unless I've misunderstood how this work) the linear model must be treating the teacher ID column as numeric data - not as categorical data. Is that right? Does it make any sense to do it like that? If not, how could I incorporate the teacher ID into the linear model as a category? – Harry Palmer Apr 21 '12 at 12:46
  • You hadn't said that the teacher column was an ID. In that case (assuming you are using `R`, turn the teacher column into a factor. One simple way to do that is `training.data$teacher = as.factor(training.data$teacher)`. It can also be done using the `colClasses` method to `read.table`. If you need help, show me the code you use to read in the data. – David Robinson Apr 21 '12 at 13:14
  • Thanks David for rescuing me again! I read up a bit on 'dummy coding' categorical data so have some vague idea of how this factor method works now. One final question - what's the best way of assessing (1) the accuracy of the whole linear model as a predictor? And (2) the value of each contributing variable towards this accuracy? – Harry Palmer Apr 21 '12 at 13:50
  • You can also look into other methods for linear regression, A simple search like linear regression using different techniques would give you many tutorials! All the best – Muaaz salagar Apr 19 '17 at 18:29