I can use SciPy to classify text on my machine, but I need to categorize string objects from HTTP POST requests at, or in near, real time. What algorithms should I research if my goals are high concurrency, near real-time output and small memory footprint? I figured I could get by with the Support Vector Machine (SVM) implementation in Go, but is that the best algorithm for my use case?
1 Answers
Yes, SVM (with a linear kernel) should be a good starting point. You can use scikit-learn (it wraps liblinear I believe) to train your model. After the model is learned, the model is simply a list of feature:weight
for each category you want to classifying into. Something like this (suppose you have only 3 classes):
class1[feature1] = weight11
class1[feature2] = weight12
...
class1[featurek] = weight1k ------- for class 1
... different <feature, weight> ------ for class 2
... different <feature, weight> ------ for class 3 , etc
At prediction time, you don't need scikit-learn at all, you can use whatever language you are using on the server backend to do a linear computation. Suppose a specific POST request contains features (feature3, feature5), what you need to do is like this:
linear_score[class1] = 0
linear_score[class1] += lookup weight of feature3 in class1
linear_score[class1] += lookup weight of feature5 in class1
linear_score[class2] = 0
linear_score[class2] += lookup weight of feature3 in class2
linear_score[class2] += lookup weight of feature5 in class2
..... same thing for class3
pick class1, or class2 or class3 whichever has the highest linear_score
One step further: If you could have some way to define the feature weight (e.g., using tf-idf score of tokens), then your prediction could become:
linear_score[class1] += class1[feature3] x feature_weight[feature3]
so on and so forth.
Note feature_weight[feature k]
is usually different for each request.
Since for each request, the total number of active features must be much smaller than the total number of considered features (consider 50 tokens or features vs your entire vocabulary of 1 MM tokens), the prediction should be very fast. I can imagine once your model is ready, an implementation of the prediction could be just written based on a key-value store (e.g., redis).

- 15,956
- 5
- 50
- 80