1

I have already a classifier trained that I load up through pickle. My main doubt is if there is anything that can speed up the classification task. It is taking almost 1 minute for each text (feature extraction and classification), is that normal? Should I go on multi-threading?

Here some code fragments to see the overall flow:

for item in items:
    review = ''.join(item['review_body'])
    review_features = getReviewFeatures(review)
    normalized_predicted_rating = getPredictedRating(review_features)
    item_processed['rating'] = str(round(float(normalized_predicted_rating),1))

def getReviewFeatures(review, verbose=True):

    text_tokens = tokenize(review)

    polarity = getTextPolarity(review)

    subjectivity = getTextSubjectivity(review)

    taggs = getTaggs(text_tokens)

    bigrams = processBigram(taggs)
    freqBigram = countBigramFreq(bigrams)
    sort_bi = sortMostCommun(freqBigram)

    adjectives = getAdjectives(taggs)
    freqAdjectives = countFreqAdjectives(adjectives)
    sort_adjectives = sortMostCommun(freqAdjectives)

    word_features_adj = list(sort_adjectives)
    word_features = list(sort_bi)

    features={}
    for bigram,freq in word_features:
        features['contains(%s)' % unicode(bigram).encode('utf-8')] = True
        features["count({})".format(unicode(bigram).encode('utf-8'))] = freq

    for word,freq in word_features_adj:
        features['contains(%s)' % unicode(word).encode('utf-8')] = True
        features["count({})".format(unicode(word).encode('utf-8'))] = freq

    features["polarity"] = polarity
    features["subjectivity"] = subjectivity

    if verbose:
        print "Get review features..."    

    return features


def getPredictedRating(review_features, verbose=True):
    start_time = time.time()
    classifier = pickle.load(open("LinearSVC5.pickle", "rb" ))

    p_rating = classifier.classify(review_features) # in the form of "# star"
    predicted_rating = re.findall(r'\d+', p_rating)[0]
    predicted_rating = int(predicted_rating)

    best_rating = 5
    worst_rating = 1
    normalized_predicted_rating = 0
    normalized_predicted_rating = round(float(predicted_rating)*float(10.0)/((float(best_rating)-float(worst_rating))+float(worst_rating)))

    if verbose:
        print "Get predicted rating..."
        print "ML_RATING: ", normalized_predicted_rating
        print("---Took %s seconds to predict rating for the review---" % (time.time() - start_time)) 

    return normalized_predicted_rating
Inês Martins
  • 530
  • 2
  • 10
  • 23
  • Have you ever profiled your code to inspect where exactly it will take a long time to process? With given information its hard to say if the time is normal. – colidyre Oct 06 '15 at 14:15
  • @colidyre Yes, it is the classifier.classify(review_features) that is taking 50 seconds. – Inês Martins Oct 06 '15 at 14:30
  • If the code works, you should ask it on [codereview](http://codereview.stackexchange.com/) – Leb Oct 06 '15 at 15:09
  • 1
    I think the `pickle` mentioned in the question has nothing to do with your problem iff [sic!] the classifier is the main reason for slowness. If it's a good idea to pickle the trained model is another question, imo. – colidyre Oct 06 '15 at 15:44
  • 1
    What is the dimension of `review_features`? – Flavio Ferrara Oct 06 '15 at 16:00
  • 1
    It's related to the topic but not to your specific problem. So see [this question](http://stackoverflow.com/q/22443041/2648551) as a side information. – colidyre Oct 06 '15 at 16:22
  • @FlavioFerrara the length of review_features is very variable, between ranges from 8 to 30 – Inês Martins Oct 06 '15 at 16:40
  • 1
    8-30 is the length (number of examples) or the dimension (number of features) of the array? Anyway, it seems indeed very slow. There is some problem there. – Flavio Ferrara Oct 06 '15 at 16:57
  • I am only extracting 4 types of features: adjectives (their presence and their counts), bigrams (their presence and their counts), polarity and subjectivity. – Inês Martins Oct 06 '15 at 17:00
  • Guys I found my real problem was really with the pickle... I was loading it everytime i need to classify a review... so I set it to a global variable and opened it in the beggining of the script... thank anyway for your suggestions! – Inês Martins Oct 08 '15 at 11:02

2 Answers2

1

NLTK is a great tool and a good starting point for Natural Language Processing, but it's sometimes not very useful if speed is important as the authors implicitly said:

NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”

So if your problem only lies in the speed of the classifier of the toolkit you have to use another ressource or you have to write the classifier by yourself.

Scikit might be helpful for you if you want to use a classifier which is probably faster.

colidyre
  • 4,170
  • 12
  • 37
  • 53
  • my classifier is an LinearSVC from sklearn.svm . To train the classifier took 16h and I saved it to a Pickle. Now I am just need it to the classifier.classify() task . I am new to ML, I just wanted to know if 1 sec to predict a category of a text is normal or if it could be reduced – Inês Martins Oct 06 '15 at 15:53
  • 1
    I strongly recommend using Scikit (now linked in answer) if you're interested in a faster way to classify. It's also used by some big companies as you can see [here](http://scikit-learn.org/stable/testimonials/testimonials.html) – colidyre Oct 06 '15 at 16:17
1

It seems that you use a dictionary to build the feature vector. I strongly suspect that the problem is there.

The proper way would be using a numpy ndarray, with examples on rows and features on columns. So, something like

import numpy as np
# let's suppose 6 different features = 6-dimensional vector
feats = np.array((1, 6))
# column 0 contains polarity, column 1 subjectivity, and so on..
feats[:, 0] = polarity
feats[:, 1] = subjectivity
# ....
classifier.classify(feats)

Of course, you must use the same data structure and respect the same convention during training.

Flavio Ferrara
  • 1,644
  • 12
  • 18
  • 1
    If you're using `np.array` in comparison to a `dictionary`, it would be great if you can measure both times and present the measured values here. – colidyre Oct 06 '15 at 19:04
  • 1
    I can't reproduce the exact problem configuration of the OP. Yet, my experience with LinearSVC teaches me it can classify hundreds of examples with thousands of features in a couple of seconds. – Flavio Ferrara Oct 06 '15 at 19:35
  • 1
    Oh sorry. This was more addressed to the OP that she can measure both times and present it here. – colidyre Oct 06 '15 at 19:42