1

I have referred to these two links to run mahout NB classifier

[1] http://tharindu-rusira.blogspot.com/2014/01/naive-bayes-classification-apache-mahout.html
[2] http://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/

I would like to use my own test set instead of having mahout splitting my data into training and test sets (80:20). How can I achieve that?

Eyal
  • 3,412
  • 1
  • 44
  • 60
mfmz
  • 227
  • 1
  • 5
  • 18

1 Answers1

2

Take two datasets for is for training & one for testing.

Run below commands on both sets:
1. seqdirectory
2. seq2sparse

Now you will have vectors generated for both datasets.
- Run trainnb command using first dataset's vectors output. So instead of training a model on 80% of the data, we are using the whole dataset.
- Run testnb command using second dataset's vectors output. This is not the 20% of the data, it's completely new dataset, solely used for testing.

So instead of using mahout split, we have specified our own dataset for testing your model.

Rajkumar
  • 249
  • 1
  • 8
  • This sounds sensible, and it's what I did. But I got results that were totally different from what I got when Mahout divided the results up in a similar percentage - I have four categories, and it decided everything was from one of them instead of dividing them up correctly (as it more or less did when it divided up the input) – Eyal Nov 18 '14 at 09:49
  • My guess is that this is connected to the labelindex - that there's a mismatch between the labels of the test and training set. Does that sound plausible? – Eyal Nov 18 '14 at 13:52
  • Yes, the labels need to be same. We should test the model with the same set of labels, that we used for training. – Rajkumar Nov 18 '14 at 14:55
  • Unfortunately, when I look at the labels (using seqdumper) they are identical. I don't know why, but the model is classifying everything as "other" (one of my four categories) - whereas when I used split it identified them with 80% accuracy. – Eyal Nov 18 '14 at 15:22
  • What is the % of accuracy you are getting, when you ran the testing with different dataset? Are you sure your test data has similar labels? – Rajkumar Nov 18 '14 at 15:38
  • Yes, there are three identical labels and "other". When I use split and divide it 80/20, I get between 64-75% accuracy (depending on the category). When I use my test set, I get everything classified as "other". Since I have 75% other in my test set, technically I get 75% accuracy, but in an in-acceptable way. – Eyal Nov 18 '14 at 15:43
  • Why is it that everytime I tried using the split command and test the classifier, my results would always leave some classes out. For instance, if my training data contains 3 classes and I split it 80/20, when I test it, the output would only be testing for the first class. It's like the split did not evenly split the training and testing sets. Is there a 'randomize' command that I need to run for mahout to evenly split my data? – mfmz Nov 19 '14 at 13:12
  • I checked my test set generated from the split command. It chose only one class. How can I make it to choose from all the classes? – mfmz Nov 19 '14 at 13:33