0

So I am trying to (just for fun) classify movies based on their description, the idea is to "tag" movies, so a given movie might be "action" and "humor" at the same time for example.

Normally when using a text classifier, what you get is the class to where a given text belongs, but in my case I want to assign a text to 1 to N tags.

Currently my training set would look like this

+--------------------------+---------+
|        TEXT              |  TAG    |
+--------------------------+---------+
| Some text from a movie   |  action |
+--------------------------+---------+
| Some text from a movie   |  humor  |
+--------------------------+---------+
| Another text here        | romance |
+--------------------------+---------+
| Another text here        | cartoons|
+--------------------------+---------+
| And some text more       | humor   |
+--------------------------+---------+

What I am doing next is to train classifiers to tell me whether or not each tag belongs to a single text, so for example, if I want to figure out whether or not a text is classified as "humor" I would end up with the following training set

+--------------------------+---------+
|        TEXT              |  TAG    |
+--------------------------+---------+
| Some text from a movie   |  humor  |
+--------------------------+---------+
| Another text here        |not humor|
+--------------------------+---------+
| And some text more       | humor   |
+--------------------------+---------+

Then I train a classifier that would learn whether or not a text is humor or not (the same approach is done with the rest of the tags). After that I end with a total of 4 classifiers that are

  • action / no action
  • humor / no humor
  • romance / no romance
  • cartoons / no cartoons

Finally when I get a new text, I apply it to each of the 4 classifiers, for each classifier that gives me a positive classification (that is, gives me X instead of no-X) if such classification is over a certain threshold (say 0.9), then I assume that the new text belongs to tag X, and then I repeat the same with each of the classifiers.

In particular I am using Naive Bayes as algorithm, but the same could be applied with any algorithm that outputs a probability.

Now the question is, is this approach correct? Am I doing something terribly wrong here? From the results I get things seems to make sense, but I would like a second opinion.

Juan Antonio Gomez Moriano
  • 13,103
  • 10
  • 47
  • 65

3 Answers3

0

Yes, this makes sense. it is a well known, basic technique for multilabel/multiclass classification known as "one vs all" (or "one vs all") classifier. This is very old and widely used. On the other hand - it is also very naive as you do not consider any relations between your classes/tags. You might be interested in reading about structure learning, which covers topics where there is some structure over labels space that can be exploited (and usually there is).

lejlot
  • 64,777
  • 8
  • 131
  • 164
0

The problem you've described can be addressed by Latent Dirichlet Allocation, a statistical topic model method to find the underlying ("latent") topics in a collection of documents. This approach is based on a model where each document is a mixture of these topics.

In general, you initially decide upon the topics (in your case, the tags are the topics) and then run a trainer. The LDA software will then output a probability distribution over the topics for each document.

Here is a good introduction: http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/

stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217
  • lda is unsupervised technique for finding latent variables, and has nothing to do with OPs question of fully supervised mutliclass problem – lejlot Apr 14 '16 at 18:39
  • @lejlot: At no point in OP's question does he mention supervised learning. He says that he has attempted to use Naive Bayes, but using supervised learning is not the only approach. To quote: `I want to assign a text to 1 to N tags`. He also says `so for example, if I want to figure out whether or not a text is classified as "humor" `. LDA can solve these problems. – stackoverflowuser2010 Apr 14 '16 at 18:56
  • OP created training set, uses multiclass generalized Naive Bayes. It is very nicely described, basic supervised setting approach. Unsupervised learning techniques are good when you try to find out something about your data. When you task is well specified - they are not a good approach, as their internal aim is not to solve your problem - but rather find solution to their own. Just like with clustering - you will always find some data clustering, but assuming that this will anyhow relate to your task is simply wrong. – lejlot Apr 14 '16 at 18:59
  • @lejlot: He tried to use Naive Bayes as a first attempt because that was his guess for how to solve the problem. That does not mean supervised learning in the solution. – stackoverflowuser2010 Apr 14 '16 at 20:10
  • 1
    Just for the record. I AM using supervised learning here. Regardless of that my questions is whether or not my approach is correct. I appreciate any suggestions of course... – Juan Antonio Gomez Moriano Apr 14 '16 at 22:01
0

Yes. your approach is correct and it is a well-known strategy to enable classifiers that are designed to do binary classification to handle multi-class classification tasks.

Andrew Ng (of Standford University) explains this approach here. Even though it is explained for logistic regression, as you mentioned the idea can be applied to any algorithm that outputs probability.

TrnKh
  • 758
  • 1
  • 7
  • 13
  • Yes, in fact I have done the course. However in that video the approach is explained so that you can use logistc regression to classify not between 2 classes but between N classes. What I want to get is similar, except that not only I want to add N classes, I also want to ASSIGN up to N classes to a given text. That said, you are right, and btw, my numbers seem to show the same :) – Juan Antonio Gomez Moriano Apr 17 '16 at 22:26
  • I think for that you can simply specify a probability threshold (possibly by observing the current numbers) and assign any of the N classes for which the probability of the movie belonging to the class goes beyond the threshold. I have seen academic articles using the same approach. – TrnKh Apr 18 '16 at 17:28