8

I am researching whether or not it is possible to automate the scoring of student's code based off of coding style. This includes things like avoiding duplicate code, commented out code, bad naming of variables and more.

We are trying to learn based off of past semester's composition scores (ranging from 1-3), which leads nicely to supervised learning. The basic idea is that we extract features from a student's submissions, and make a feature_vector, then run it through logistic regression using scikit-learn. We also have tried various things including running PCA on the feature vectors to reduce dimensionality.

Our classifier is simply guessing the most frequent class, which is a score of 2. I believe that it's because our features are simply NOT predictive in any way. Is there any other possible reason for a supervised learning algorithm to only guess the dominant class? Is there any way to prevent this?

As I believe it's due to the features not being predictive, is there a way to determine what a "good" feature would be? (And by good, I mean discriminable or predictive).

Note: As a side experiment, we tested how consistent the past grades were by having readers grade assignments that had already been graded. Only 55% of them gave the same composition score (1-3) for the projects. This might mean this dataset is simply not classifiable because humans can't even grade consistently. Any tips on other ideas? Or whether or not that is in fact the case?

Features include: Number of lines of duplicate code, average function length, number of 1 character variables, number of lines that include commented out code, maximum line length, count of unused imports, unused variables, unused parameters. A few more... We visualized all of our features and found that while the average is correlated with the score, the variation is really large (not promising).

Edit: Scope of our project: we are only trying to learn from one particular project (with skeleton code given) in one class. We don't need to generalize as of yet.

stogers
  • 259
  • 2
  • 12
  • 1
    +1. Wow! what a question/ – Yavar Nov 18 '13 at 06:36
  • However the answer will be more driven by Statistics here and not Computer Sciences. – Yavar Nov 18 '13 at 06:38
  • Included "statistics" as a tag. Thanks! – stogers Nov 18 '13 at 06:40
  • Please clarify your note at the end of the Q: "tested how consistent the past grades were and determined that they weren't at all". Consistent according to what? How did you test? – dan3 Nov 18 '13 at 13:15
  • This sounds more like linear regression (numeric prediction) rather than logistic one (classification task). With linear regression you will get numbers like 1.2, 1.8, 1.5, ... instead of simply label "2", which may give you some insights. Also note that linear model (in both - linear and logistic regression) may just be a bad way to represent relations between variables. So you can also try other approaches like splitting data with hyperplanes (SVM, possibly with non-linear kernels) or counting probabilities (e.g. Naive Bayes). BTW, what features do you use (some examples would be helpful). – ffriend Nov 18 '13 at 22:36
  • I'll edit the post to include some of the features as you suggested. – stogers Nov 18 '13 at 23:49
  • As for what type of regression, I was under the impression that logistic regression, while good for classification tasks, can also be used as a numeric prediction similar to what linear regression does. Since it provides a probability of it being that particular class, it becomes a continuous output. I think it's the better algorithm for this case, since we can't assume there are linear relationships among the features. – stogers Nov 18 '13 at 23:57
  • Logistic regression differs from linear regression only in final sigmoid, but essentially is the same linear model. So, using logistic regression you assume that a score of a student's code _linearly depends_ on each of your features. I would proceed with looking at `coef_` attribute of your model (for logistic regression it shows variable coefficients for each class) and checking if it makes any sense at all. – ffriend Nov 19 '13 at 05:54
  • Treating this as a regression problem gives you a relationship between the *classes*. That is probably appropriate as 1 is closer to 2 than it is to 3. Logistic regression is a classification method (probability of class membership is the regression output), which cannot encode relationship between classes. And you can make either non-linear by adding cross-products as features – Ben Allison Nov 19 '13 at 09:33
  • Also for what it's worth, a useful diagnostic on problems like this is to throw out the middle class and see if you can separate the 1s from 3s reliably. – Ben Allison Nov 19 '13 at 09:34
  • We did in fact try to separate 1s and 3s, and it was still only guessing the more frequent class of 3 in most cases. We will definitely look into linear regression as well. – stogers Nov 19 '13 at 20:57

3 Answers3

1

Features include: Number of lines of duplicate code, average function length, number of 1 character variables, number of lines that include commented out code, maximum line length, count of unused imports, unused variables, unused parameters. A few more..

Have you tried normalizing the features? It seems that you want to train a neural network which is able to classify any given code into a category. Now because different codes will have say, different number of lines of duplicate code and different number of unused variables but may be equally bad. For this reason, you need to normalize your parameters by say, total lines of 'useful' code.

Failing to find good features is very daunting. When stagnant, always follow your intuition. If a human can do a task, so can a computer. Since your features look quite modest for assessing any given code, they ought to work (given that they are used properly).

Summary: Normalization of features should solve the problem.

hrs
  • 487
  • 5
  • 18
  • If I understand what you are saying, I don't think we are actually having the problem. At least for the scope of our project, we are only trying to learn from one particular project in one class. Since ours is so specific, we won't need to normalize by something like total number of lines in the project. (Sorry I didn't mention that in the question). – stogers Nov 19 '13 at 21:03
  • Okay. In that case, it looks like the given problem is difficult to solve by a linear model. How you tried any non-linear model, say SVM with non-linear kernel or a GMM ? – hrs Nov 20 '13 at 04:46
1

Just a thought - Andrew Ng teaches a Machine Learning course on Coursera (https://www.coursera.org/course/ml). There are several programming assignments that students submit throughout the class. I remember reading (though unfortunately I can't find the article now) that there was some ongoing research that was attempting to cluster student submitted programming assignments from the class, with the intuition that there are common mistakes that students make on the assignments.

Not sure if this helps you, but perhaps treating this as an unsupervised learning problem might make more sense (e.g., just looking for similarities in different code samples with the intuition that the code samples that are similar should receive a similar score).

mattnedrich
  • 7,577
  • 9
  • 39
  • 45
  • Thanks! Definitely considering this path. I believe the following is the paper you are referring to?http://people.csail.mit.edu/zp/moocshop2013/paper_16.pdf My professor also pointed us in that direction, so I think this might be our next approach! – stogers Nov 20 '13 at 21:58
0
  1. You want to balance your target classes (a close-to-equal number of 1,2,3 scores). You can randomly sample over-sized classes, bootstrap sample under-sized classes, or use an algorithm that accounts for unbalanced data (not sure which in Python do).

  2. Make sure you are cross-validating to prevent over-fitting

  3. There are a few ways to figure out which attributes are important:

    • try all combinations of attributes, starting with one of them
    • or try all combinations of attributes, starting with them all
    • or try attribute combinations at random (or w genetic algo)

Choose the attribute combo with the highest cross-validated accuracy.

You can also take the product of the attribute columns to see if they cause an effect together.

Neil McGuigan
  • 46,580
  • 12
  • 123
  • 152