-3

I'm building a text classifier, which should be able to give probabilities that a document belongs to certain categories (i.e. 80% fiction, 30% marketing etc)

I believe Libsvm does this via the "predict" method, but the problem is that I have approximately 20 categories to test for. Also, I have several hundred documents that can be used for the training.

The problem is that the training file gets 1 GB - 2 GB big, and this makes Libsvc super-slow.

How can this issue be solved? And should I go for Liblinear instead, or are there better options?

David Niki
  • 1,092
  • 1
  • 11
  • 14
  • 2
    Please do some basic research first. This includes SVMs disadvantage at predicting probabilities (keywords: Platt-scaling; CV-based probability-calibration) and their (usually) binary nature (although there might exist some multinomial configuration). Then you can try to deduce what's the difference between libsvm and liblinear. Everything else would be very broad here / guessing. – sascha Jun 26 '18 at 22:50
  • SVMs can be used for predicting probabilities. Both LibSVM and Liblinear have methods for this. The only difference I've found so far is that LibSVM is a little bit more accurate and a LOT slower, making it useless for large training files – David Niki Jun 27 '18 at 17:18
  • As reference, LibSVM uses Platt-scaling (https://stats.stackexchange.com/a/211973/212921) – David Niki Jun 27 '18 at 17:28
  • I am using LIBSVM for text-classification with document-collection > 40k documents (with or without enabled Platt-scaling) in a reasonable amount of time. How big is your feature space? – rzo1 Jun 28 '18 at 15:06
  • 500 documents for approx 20 categories. Stored on disk, it becomes 2 GB! – David Niki Jun 28 '18 at 18:16

1 Answers1

0

Regarding this specific question, I had to use Liblinear as LibSVC kept running forever.

But if anyone wants to know how it eventually turned out:

  1. I switched from PHP / C++ to Python, which was tremendously easier and did not encounter any memory issues
  2. My case was "multi-labelling". This article put me in the right direction, and the magpie project helped me accomplish the task.
David Niki
  • 1,092
  • 1
  • 11
  • 14