Make my python scikit in function in python-rq queue run faster?

Question

I currently have a utilities.py file that has this machine learning function

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import models
import random

words = [w.strip() for w in open('words.txt') if w == w.lower()]
def scramble(s):
    return "".join(random.sample(s, len(s)))

@models.db_session
def check_pronounceability(word):

    scrambled = [scramble(w) for w in words]

    X = words+scrambled
    y = ['word']*len(words) + ['unpronounceable']*len(scrambled)
    X_train, X_test, y_train, y_test = train_test_split(X, y)

    text_clf = Pipeline([
        ('vect', CountVectorizer(analyzer='char', ngram_range=(1, 3))),
        ('clf', MultinomialNB())
        ])
    text_clf = text_clf.fit(X_train, y_train)
    stuff = text_clf.predict_proba([word])
    pronounceability = round(100*stuff[0][1], 2)
    models.Word(word=word, pronounceability=pronounceability)
    models.commit()
    return pronounceability

Which I then call in my app.py

from flask import Flask, render_template, jsonify, request
from rq import Queue
from rq.job import Job
from worker import conn
from flask_cors import CORS
from utilities import check_pronounceability

app = Flask(__name__)

q = Queue(connection=conn)

import models
@app.route('/api/word', methods=['POST', 'GET'])
@models.db_session
def check():
    if request.method == "POST":
        word = request.form['word']
        if not word:
            return render_template('index.html')
        db_word = models.Word.get(word=word)
        if not db_word:
            job = q.enqueue_call(check_pronounceability, args=(word,))
        return jsonify(job=job.id)

After reading the python-rq preformance notes it states

A pattern you can use to improve the throughput performance for these kind of jobs can be to import the necessary modules before the fork.

Which I then made the worker.py file look like this

import os

import redis
from rq import Worker, Queue, Connection

listen = ['default']

redis_url = os.getenv('REDISTOGO_URL', 'redis://localhost:6379')

conn = redis.from_url(redis_url)
import utilities

if __name__ == '__main__':
    with Connection(conn):
        worker = Worker(list(map(Queue, listen)))
        worker.work()

The problem I have is this still runs slow, is there something I am doing wrong? Any way I can make this run faster by storing everything in memory when I'm checking a word? According to xpost I did in the python-rq it seems I am importing it correctly

score 2 · Accepted Answer · answered Feb 15 '17 at 14:14

I have a few suggestions:

before you start optimising the throughput of python-rq check where the bottleneck is. I'd be surprised if the queue was the bottleneck and not the check_pronounceability function.
make sure check_pronounceability runs as fast as it can per call, forget the queue that's irrelevant at this stage.

To optimise check_pronounceability I would suggest you

create the training data once for all API calls
forget the train_test_split you're not using the test_split, so why are you wasting CPU cycles creating it
train NaiveBayes once for all API calls - the input to check_pronounceability is a single word that needs to be classified as either pronounceable or not, there's no need to create a new model for every single new word, just create one model and reuse that for all words, this will have the benefit of producing stable results as well, and it makes it easier to change the model

Suggested edits below

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
import models
import random

words = [w.strip() for w in open('words.txt') if w == w.lower()]
def scramble(s):
    return "".join(random.sample(s, len(s)))

scrambled = [scramble(w) for w in words]
X = words+scrambled
# explicitly create binary labels
label_binarizer = LabelBinarizer()
y = label_binarizer.fit_transform(['word']*len(words) + ['unpronounceable']*len(scrambled))

text_clf = Pipeline([
    ('vect', CountVectorizer(analyzer='char', ngram_range=(1, 3))),
    ('clf', MultinomialNB())
])
text_clf = text_clf.fit(X, y)
# you might want to persist the Pipeline to disk at this point to ensure it's not lost in case there is a crash    

@models.db_session
def check_pronounceability(word):
    stuff = text_clf.predict_proba([word])
    pronounceability = round(100*stuff[0][1], 2)
    models.Word(word=word, pronounceability=pronounceability)
    models.commit()
    return pronounceability

Final notes:

I assume you've done some crossvalidation of the model elsewhere to actually figure out that it's doing a good job at predicting the label probabilities, if you haven't you should.
NaiveBayes in general isn't the best at producing reliable class probability predictions, it tends to be either overly confident or overly timid (probabilities close to 1 or 0). You should check for that in the DB. Using a LogisticRegression classifier should give you much more reliable probability predictions. Now that the model training isn't part of the API call it doesn't really matter how long it takes to train the model.

Thank you, this fixed it. I will take a look at using a LogisticRegressions classifer instead — nadermx, Feb 15 '17 at 19:27

Make my python scikit in function in python-rq queue run faster?

1 Answers1