0

I have a dataset of a particular domain (say sports - 1 class). What I want to do is when I fed a web page to the classifier/clusterer I want to get a result whether that instance (web page) is related to sports or not.

Most of the classifiers in weka are not capable of dealing with unary class datasets except the LibSVM (wrapper). I did some tests with the LibSVM, but the problem is during tests on a unrelated dataset, I get all of them correctly classified, even if the instances are empty! Any suggestions?
What if I use the cosine similarity measure here?

Amro
  • 123,847
  • 25
  • 243
  • 454
samsamara
  • 4,630
  • 7
  • 36
  • 66

1 Answers1

3

Have you seen this thread unary class text classification in weka? and this post https://list.scms.waikato.ac.nz/mailman/htdig/wekalist/2007-October/011631.html ?

I'm assuming you meant that when you run the classifier against another dataset that is not "sports" it gets the results incorrectly classified (i.e. false positives) e.g. "this is sports".

Are you certain your dataset only contains one class? Did you make sure the dataset does not contain any empty instances? (don't mock, this has happened to me before).

In the comments of the previously mentioned thread there is a linked to a PDF on tuning SVM: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf - I would say SVMs are a bit harder than other common classifiers.

As an alternative, can't you switch the problem to binary classification? It's much easier to get good results and for most problems there are plenty of examples of things that are not in that class e.g. sports websites vs funny image web sites, programming websites, etc ...

PS: you can use other algorithms for outlier detection: http://en.wikipedia.org/wiki/Outlier_detection

Community
  • 1
  • 1
rei
  • 86
  • 3
  • Yes I have seen that thread 'coz it was started by me :) and I have seen all the other resources you mentioned as well. Yes your assumption is right. My dataset contains only single class (I'm using weka and contains only one folder in the path -> 1 class) instances. I rechecked the training dataset and there are no empty instances (no need of mocking :D). I have done tunning the SVM gamma and the nu parameter but couldn't get a reliable model. I can't go with a binary classification here as I'm doing this for a web crawling research and the web pages you get are not pre known. Contd.. – samsamara May 13 '12 at 15:35
  • What about using cosine similarity here? I can build the centroid with the most frequent words in the training data, then once a new instance is fed, based on the similarity score I can determine how relevant it is or not right? – samsamara May 13 '12 at 15:38
  • Yes. That sounds like a good solution. It should work regardless of measure (cosine, euclidean, etc). – rei May 13 '12 at 22:49
  • If you're going to do that you'll get better results if you weight the word frequency (effectively removing common words from the equation) and take stop words into account. also give a lot more weight to words in the page title. One thing tough, for web data it's easy to get a lot of example data. I'm sort of doing that using RSS feed links: news sites that have specific feeds for different categories - business, sports, politics), directories etc you can build huge example datasets. – rei May 13 '12 at 22:58
  • I did a test in weka with the distance measure as the euclidean, but all my test instances are clustered into the same cluster. I dont really understand the reason for this. what could be the reason? – samsamara May 14 '12 at 03:01
  • That was what we were trying to do: 1 category (e.g. sports), 1 cluster. (assuming the cluster was calculated correctly). Now calculate the distance between the centroid of the cluster and examples from different categories (e.g. programming, cooking). If it's working, those distances should be significantly bigger than between the centroid and examples from which the cluster was calculated (e.g. sports). – rei May 14 '12 at 04:50
  • So did you use cosine similarity for that? Did you get satisfiable results? – samsamara May 14 '12 at 05:04
  • Sorry. in the previous comment I meant "we" as in you :) now that you have the centroid of the cluster try to use it in that way. It should work as a single class classifier in theory. I've never used it that way. – rei May 14 '12 at 07:47