3

I have built a classification model using weka.I have two classes namely {spam,non-spam} After applying stringtowordvector filter, I get 10000 attributes for 19000 records. Then I am using liblinear library to build model which gives me F-score as follows: Spam-94% non-spam-98%

When I use same model to predict new instances, it predict all of them as spam. Also, when I try to use test set same as training set, It predict all of them as spam too. I am mentally exhausted to find the problem.Any help will be appreciated.

user2335004
  • 121
  • 1
  • 10

1 Answers1

0

I get it also wrong every so often. Then I watch this video to remind myself how it's done: https://www.youtube.com/watch?v=Tggs3Bd3ojQ where Prof Witten, one of the Weka Developers/Architects shows how to use the FilteredClassifier (which in turn is configured to load the StringToWordVector Filter) on the training-dataset and the test-set correctly.

This is shown for weka 3.6, weka 3.7. might be slightly different.

What does ZeroR give you? If it's close to 100%, you know that any classification algorithm should be not too far off either.

Why do you optimize for F-Measure? Just asking. I have never used this and don't know much about it. (I would optimize for the "Precision" metric assuming you have much more Spam than Nonspam).

knb
  • 9,138
  • 4
  • 58
  • 85