0

I need to classify website text with zero or more categories/labels (5 labels such as finance, tech, etc). My problem is handling text that isn't one of these labels.

I tried ML libraries (maxent, naive bayes), but they match "other" text incorrectly with one of the labels. How do I train a model to handle the "other" text? The "other" label is so broad and it's not possible to pick a representative sample.

Since I have no ML background and don't have much time to build a good training set, I'd prefer a simpler approach like a term frequency count, using a predefined list of terms to match for each label. But with the counts, how do I determine a relevancy score, i.e. if the text is actually that label? I don't have a corpus and can't use tf-idf, etc.

cherpa123
  • 1
  • 2
  • Why can't you use TF-IDF? That's the canonical tool for getting text comparison metrics. – Prune Apr 21 '17 at 23:30
  • 1
    BTW, you keep saying that you have no background in this stuff. This is not an excuse: you *need* to develop skills in this area to solve your problem. StackOverflow is *not* a coding service. As far as I know, there are no pre-packaged solutions to your paradigm -- and SO is not the place to look for them. – Prune Apr 21 '17 at 23:33
  • TF-IDF is probably most feasible -- thanks for pointing it out. My team wanted to avoid a solution that involves collecting training data, so that's why I was asking if there was a simpler solution. – cherpa123 Apr 24 '17 at 15:00

5 Answers5

1

Another idea , is to user neural networks with softmax output function, softmax will give you a probability for every class, when the network is very confident about a class, will give it a high probability, and lower probabilities to the other classes, but if its insecure, the differences between probabilities will be low and none of them will be very high, what if you define a treshold like : if the probability for every class is less than 70% , predict "other"

Luis Leal
  • 3,388
  • 5
  • 26
  • 49
  • Thank you for your suggestion. Using neural networks would be great, but it seems a lot more difficult to use than ML, and I have no background in this stuff. – cherpa123 Apr 21 '17 at 23:12
0

Whew! Classic ML algorithms don't combine both multi-classification and "in/out" at the same time. Perhaps what you could do would be to train five models, one for each class, with a one-against-the-world training. Then use an uber-model to look for any of those five claiming the input; if none claim it, it's "other".

Another possibility is to reverse the order of evaluation: train one model as a binary classifier on your entire data set. Train a second one as a 5-class SVM (for instance) within those five. The first model finds "other"; everything else gets passed to the second.

Prune
  • 76,765
  • 14
  • 60
  • 81
  • Thanks, but how do I pick training data that's not in one of my classes? It seems like a huge amount of work to collect the training data and I may overlook a topic that has words in common with one of my labels. Is there anything simpler than ML that would work involving term frequency counts? – cherpa123 Apr 21 '17 at 21:13
  • For the second possibility, you don't. Use the "one-class" SVM, which finds a reasonable boundary for the present data. I think this is likely to be the more accurate aproach. – Prune Apr 21 '17 at 21:19
  • Do you mean use 1-class SVM instead of a binary classifier you originally suggested for the second possibility? – cherpa123 Apr 21 '17 at 21:28
  • 1-class SVM *is* a binary classifier: there are simply no observations in the **0** class, but it still classifies as "good" and "bad". – Prune Apr 21 '17 at 21:32
  • Thank you, this was very helpful. – cherpa123 Apr 21 '17 at 21:39
  • Glad to hear it! When you get to a resolution, please remember to up-vote useful things and accept your favourite answer (even if you have to write it yourself), so Stack Overflow can properly archive the question. – Prune Apr 21 '17 at 21:40
  • I looked into SVM, but unfortunately it's over my head. LIBSVM seemed most useable (I need a Java library), but I don't understand how to easily build the training data or how a model trained on numbers knows how to test text. – cherpa123 Apr 21 '17 at 23:10
0

What about creating histograms? You could use a bag of words approach using significant indicators of for e.g. Tech and Finance. So, you could try to identify such indicators by analyzing the certain website's tags and articles or just browse the web for such inidicators:

http://finance.yahoo.com/news/most-common-words-tech-finance-205911943.html

Let's say your input vactor X has n dimensions where n represents the number of indicators. For example Xi then holds the count for the occurence of the word "asset" and Xi+k the count of the word "big data" in the current article.

Instead of defining 5 labels, define 6. Your last category would be something like a "catch-all" category. That's actually your zero-match category.

If you must match the zero or more category, train a model which returns probability scores (such as a neural net as Luis Leal suggested) per label/class. You could than rate your output by that score and say that every class with a score higher than some threshold t is a matching category.

Bastian
  • 1,553
  • 13
  • 33
0

Try this NBayes implementation.
For identifying "Other" categories, dont bother much. Just train on your required categories which clearly identifies them, and introduce a threshold in the classifier.
If the values for a label does not cross a threshold, then the classifier adds the "Other" label.

It's all in the training data.

arjun
  • 1,594
  • 16
  • 33
0

AWS Elasticsearch percolate would be ideal, but we can't use it due to the HTTP overhead of percolating documents individually.

Classify4J appears to be the best solution for our needs because the model looks easy to train and it doesn't require training of non-matches. http://classifier4j.sourceforge.net/usage.html

cherpa123
  • 1
  • 2