1

I'm trying to do some text processing on around 200,000 entries in a SQlite database which I'm accessing using SQLAlchemy. I'd like to parallelize it (I'm looking at Parallel Python), but I'm not sure how exactly to do it.

I want to commit the session each time an entry is processed, so that if I need to stop the script I won't lose the work it's already done. However, when I try to pass the session.commit() command to the callback function, it does not seem to work.

from assignDB import *
from sqlalchemy.orm import sessionmaker
import pp, sys, fuzzy_substring

def matchIng(rawIng, ingreds):
maxScore = 0
choice = ""
for (ingred, parentIng) in ingreds.iteritems():
    score = len(ingred)/(fuzzy_substring(ingred,rawIng)+1)
    if score > maxScore:
        maxScore = score
        choice = ingred
        refIng = parentIng  
return (refIng, choice, maxScore)

def callbackFunc(match, session, inputTuple):
    print inputTuple
    match.refIng_id = inputTuple[0]
    match.refIng_name = inputTuple[1]
    match.matchScore = inputTuple[2]
    session.commit()

# tuple of all parallel python servers to connect with
ppservers = ()
#ppservers = ("10.0.0.1",)

if len(sys.argv) > 1:
    ncpus = int(sys.argv[1])
    # Creates jobserver with ncpus workers
    job_server = pp.Server(ncpus, ppservers=ppservers)
else:
    # Creates jobserver with automatically detected number of workers
    job_server = pp.Server(ppservers=ppservers)

print "Starting pp with", job_server.get_ncpus(), "workers"

ingreds = {}
for synonym, parentIng in session.query(IngSyn.synonym, IngSyn.parentIng): 
    ingreds[synonym] = parentIng

jobs = []
for match in session.query(Ingredient).filter(Ingredient.refIng_id == None):
    rawIng = match.ingredient
    jobs.append((match, job_server.submit(matchIng,(rawIng,ingreds),    (fuzzy_substring,),callback=callbackFunc,callbackargs=(match,session))))

The session is imported from assignDB. I'm not getting any error, it's just not updating the database.

Thanks for your help.

UPDATE Here is the code for fuzzy_substring

def fuzzy_substring(needle, haystack):
    """Calculates the fuzzy match of needle in haystack,
    using a modified version of the Levenshtein distance
    algorithm.
    The function is modified from the levenshtein function
    in the bktree module by Adam Hupp"""
    m, n = len(needle), len(haystack)

    # base cases
    if m == 1:
        return not needle in haystack
    if not n:
        return m

    row1 = [0] * (n+1)
    for i in range(0,m):
        row2 = [i+1]
        for j in range(0,n):
            cost = ( needle[i] != haystack[j] )

            row2.append( min(row1[j+1]+1, # deletion
                               row2[j]+1, #insertion
                               row1[j]+cost) #substitution
                           )
        row1 = row2
    return min(row1)

which I got from here: Fuzzy Substring. In my case, "needle" is one of ~8000 possible choices, while haystack is the raw string I'm trying to match. I loop over all possible "needles" and choose the one with the best score.

abroekhof
  • 796
  • 1
  • 7
  • 20

2 Answers2

3

Without looking at your specific code, it can be fairly said that:

  1. Using serverless SQLite and
  2. Seeking increased write performance through paralleism

are mutually incompatible desires. Quoth the SQLite FAQ:

… However, client/server database engines (such as PostgreSQL, MySQL, or Oracle) usually support a higher level of concurrency and allow multiple processes to be writing to the same database at the same time. This is possible in a client/server database because there is always a single well-controlled server process available to coordinate access. If your application has a need for a lot of concurrency, then you should consider using a client/server database. But experience suggests that most applications need much less concurrency than their designers imagine. …

And that's even without whatever gating and ordering SQLAlchemy uses. It is also not clear at all when — if at all — the Parallel Python jobs are completing.

My suggestion: get it working correctly first and then look for optimizations. Especially when the pp secret sauce might not be buying you much at all even if it was working perfectly.

added in response to comment:

If fuzzy_substring matching is the bottleneck it appears completely decoupled from the database access and you should keep that in mind. Without seeing what fuzzy_substring is doing, a good starting assumption is that you can make algorithmic improvements which may make the single-threaded programming computationally feasible. Approximate string matching is a very well studied problem and choosing the right algorithm is often far better than "throw more processors at it".

Far better in this sense is that you have cleaner code, don't waste the overhead of segmenting and reassembling the problem, have a more extensible and debuggable program at the end.

msw
  • 42,753
  • 9
  • 87
  • 112
  • Hi, thanks for your answer--what you say makes complete sense. I have the code working and the bottleneck is fuzzy_substring function, as it must compare (fuzzily) a raw input with ~8000 possible results. What I'm thinking now is that it might make sense to compute a number of results (maybe 30 or so) in parallel, and then do a single session.commit() and restart with another 30? – abroekhof Jul 15 '12 at 14:26
  • @abroekhof see "added" in the body of the answer. – msw Jul 15 '12 at 15:28
  • I've updated the question with the fuzzy-substring code. I looked at a couple of fuzzy substring methods, but this one seems to perform the best. A lot of the other algorithms have trouble matching a substring. I've implemented the problem subdivision I mentioned in my first comment and it seems to have a bit of a speedup. – abroekhof Jul 15 '12 at 18:27
0

@msw has provided an excellent overview of the problem, giving a general way to think about parallelization.

Notwithstanding these comments, here is what I got to work in the end:

from assignDB import *
from sqlalchemy.orm import sessionmaker
import pp, sys, fuzzy_substring  

def matchIng(rawIng, ingreds):
    maxScore = 0
    choice = ""
    for (ingred, parentIng) in ingreds.iteritems():
        score = len(ingred)/(fuzzy_substring(ingred,rawIng)+1)
        if score > maxScore:
            maxScore = score
            choice = ingred
            refIng = parentIng  
    return (refIng, choice, maxScore)

# tuple of all parallel python servers to connect with
ppservers = ()
#ppservers = ("10.0.0.1",)

if len(sys.argv) > 1:
    ncpus = int(sys.argv[1])
    # Creates jobserver with ncpus workers
    job_server = pp.Server(ncpus, ppservers=ppservers)
else:
    # Creates jobserver with automatically detected number of workers
    job_server = pp.Server(ppservers=ppservers)

print "Starting pp with", job_server.get_ncpus(), "workers"

ingreds = {}
for synonym, parentIng in session.query(IngSyn.synonym, IngSyn.parentIng): 
    ingreds[synonym] = parentIng

rawIngredients = session.query(Ingredient).filter(Ingredient.refIng_id == None)
numIngredients = session.query(Ingredient).filter(Ingredient.refIng_id == None).count()
stepSize = 30

for i in range(0, numIngredients, stepSize):
    print i
    print numIngredients

    if i + stepSize > numIngredients:
        stop = numIngredients
    else:
        stop = i + stepSize

    jobs = []
    for match in rawIngredients[i:stop]:
        rawIng = match.ingredient
        jobs.append((match, job_server.submit(matchIng,(rawIng,ingreds),    (fuzzy_substring,))))

    job_server.wait()

    for match, job in jobs:
        inputTuple = job()
        print match.ingredient
        print inputTuple
        match.refIng_id = inputTuple[0]
        match.refIng_name = inputTuple[1]
        match.matchScore = inputTuple[2]
    session.commit()

Essentially, I've chopped the problem into chunks. After matching 30 substrings in parallel, the results are returned and committed to the database. I chose 30 somewhat arbitrarily, so there might be gains to be had in optimizing that number. It seems to have sped up a fair bit, as I'm using all 3(!) of the cores in my processor now.

Community
  • 1
  • 1
abroekhof
  • 796
  • 1
  • 7
  • 20