Which Machine Learning technique is most valid in this scenario?

Question

I am fairly new to Machine Learning and have recently been working on a new classification problem to which I'm giving the link below. Since cars interest me, I decided to go with a dataset that deals with the classification of cars based on several attributes.

http://archive.ics.uci.edu/ml/datasets/Car+Evaluation

Now, I understand that there might be a number of ways to go about this particular case, but the real issue here is - Which particular algorithm might be most effective?

I am considering Regression, SVM, KNN, and Hidden Markov Models. Any suggestions at all would be greatly appreciated.

I did a -1 because this question makes no sense. It's like asking how to be rich. — ABCD, Nov 23 '15 at 00:31
I apologize for being vague. But, like I said, I am in the process of strengthening my fundamentals, and just sought guidance. — Karthik, Nov 23 '15 at 23:07
First of all, you'll need to tell us what you want to classify and the input variables. This should be your first time that you try to do. — ABCD, Nov 23 '15 at 23:25
Yes, you're right. I want to classify the various cars in the dataset based on the following parameters: 1. buying (v-high, high, med, low) 2. maint (v-high, high, med, low) 3. doors (2, 3, 4, 5-more) 4. persons (2, 4, more) 5. lug_boot (small, med, big) 6. safety (low, med, high) — Karthik, Nov 23 '15 at 23:57

score -1 · Accepted Answer · answered Nov 23 '15 at 00:18

You have a multi-class classification problem with 1728 samples. The features are in 6 groups:

buying       v-high, high, med, low
maint        v-high, high, med, low
doors        2, 3, 4, 5-more
persons      2, 4, more
lug_boot     small, med, big
safety       low, med, high

what you need to do for features is to create features like this:

buying_v-high, buying-high, buying-med, buying-low, maint-v-high, ...

at the end you'll have

4+4+4+3+3+3 = 21

features. The output classes are:

class      N          N[%]
-----------------------------
unacc     1210     (70.023 %) 
acc        384     (22.222 %) 
good        69     ( 3.993 %) 
v-good      65     ( 3.762 %)

You need to try several classification algorithms to see which one works better. For evaluation you can use cross-validation or you can put away say 728 or the samples and evaluate on that.

For classification models you iterate over 10 different classification models available in Machine Learning libraries and check which one is better. I suggest using scikit-learn for simplicity.

You can find a simple iterator over several classifiers in this script.

Remember that you need to tune some parameters for each model and you shouldn't tune them on the test set. So it is better to divide your samples into 1000 (training set), 350 (development set), 378 (test set). Use the development set to tune your parameters and to choose the best performing model and then use the test set to evaluate that model over unseen data.

That certainly does help. However, I am not very familiar with libraries and their usage either. I thought I'd start off by learning them the hard way in the beginning so as to understand the nuances more clearly. The idea is to work with one algorithm at a time and observe the differences between different algorithms. — Karthik, Nov 23 '15 at 23:43
I think the main group of models you need to take a look on is : KNN (nonparametric), NaiveBayes (generative) and SGDClassifier (discriminative e.g. LogisticRegression or SVM). I would start with LogisticRegression (SGDClassifier with log loss and regularization). — Ash, Nov 24 '15 at 04:35
Yes, since this is a multi-class problem, I think those would be our options. Although my bet is on SVM, I want to test out the others as well. — Karthik, Nov 25 '15 at 05:44

Which Machine Learning technique is most valid in this scenario?

1 Answers1