Mahout 0.9: Using own test set instead of using split command

Question

I have referred to these two links to run mahout NB classifier

[1] http://tharindu-rusira.blogspot.com/2014/01/naive-bayes-classification-apache-mahout.html
[2] http://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/

I would like to use my own test set instead of having mahout splitting my data into training and test sets (80:20). How can I achieve that?

Rajkumar · Accepted Answer · 2014-11-19T11:08:49.137

2

Take two datasets for is for training & one for testing.

Run below commands on both sets:
1. seqdirectory
2. seq2sparse

Now you will have vectors generated for both datasets.
- Run trainnb command using first dataset's vectors output. So instead of training a model on 80% of the data, we are using the whole dataset.
- Run testnb command using second dataset's vectors output. This is not the 20% of the data, it's completely new dataset, solely used for testing.

So instead of using mahout split, we have specified our own dataset for testing your model.

edited Nov 19 '14 at 11:08

answered Nov 17 '14 at 16:09

Rajkumar

249
1
8

This sounds sensible, and it's what I did. But I got results that were totally different from what I got when Mahout divided the results up in a similar percentage - I have four categories, and it decided everything was from one of them instead of dividing them up correctly (as it more or less did when it divided up the input) – Eyal Nov 18 '14 at 09:49
My guess is that this is connected to the labelindex - that there's a mismatch between the labels of the test and training set. Does that sound plausible? – Eyal Nov 18 '14 at 13:52
Yes, the labels need to be same. We should test the model with the same set of labels, that we used for training. – Rajkumar Nov 18 '14 at 14:55
Unfortunately, when I look at the labels (using seqdumper) they are identical. I don't know why, but the model is classifying everything as "other" (one of my four categories) - whereas when I used split it identified them with 80% accuracy. – Eyal Nov 18 '14 at 15:22
What is the % of accuracy you are getting, when you ran the testing with different dataset? Are you sure your test data has similar labels? – Rajkumar Nov 18 '14 at 15:38
Yes, there are three identical labels and "other". When I use split and divide it 80/20, I get between 64-75% accuracy (depending on the category). When I use my test set, I get everything classified as "other". Since I have 75% other in my test set, technically I get 75% accuracy, but in an in-acceptable way. – Eyal Nov 18 '14 at 15:43
Why is it that everytime I tried using the split command and test the classifier, my results would always leave some classes out. For instance, if my training data contains 3 classes and I split it 80/20, when I test it, the output would only be testing for the first class. It's like the split did not evenly split the training and testing sets. Is there a 'randomize' command that I need to run for mahout to evenly split my data? – mfmz Nov 19 '14 at 13:12
I checked my test set generated from the split command. It chose only one class. How can I make it to choose from all the classes? – mfmz Nov 19 '14 at 13:33

Mahout 0.9: Using own test set instead of using split command

1 Answers1