Parallelizing phonetic distance between all pairwise combinations of words in a document

Question

I am trying to calculate the phonetic distance between every word in a document. My typical document is on the order of 30,000 unique words, pairwise that is on the order of 500,000,000 (n*(n-1)/2) combinations to compute. I am using pyphonetics and RefinedSoundex to compute the distance, which is rather quick and computes a single distance in a few microseconds. A full document would then take around five hours. I've been trying to parallelize this, but I can't figure out why it isn't working. For default, no multiprocessing, I'm using list comprehension. For multiprocessing, I've tried futures.ProcessPoolExecutor and ray, both seem to do much worse than list comprehension. I don't understand why they doesn't do better. My code is below, which uses just the text of this post to benchmark.

from multiprocessing import Pool
import time
import string

import numpy as np
from concurrent import futures
import pandas as pd
from itertools import combinations
from pyphonetics import RefinedSoundex
import ray
ray.init(num_cpus=6)
rs = RefinedSoundex()


def distance(a):
    # time.sleep(1)
    return rs.distance(a[0], a[1])


@ray.remote
def distance2(a):
    # time.sleep(1)
    return rs.distance(a[0], a[1])


def bench(f):
    s = "I am trying to calculate the phonetic distance between every word in a document. \
        My typical document is on the order of thirty thousand words, pairwise that is on \
        the order of five hundred \million combinations to compute. I am using pyphonetics \
        and RefinedSoundex to compute the distance, which is rather quick and computes a \
        single distance in a few microseconds. A full document would then take around \
        five hours. I've been trying to parallelize this, but I can't figure out why it \
        isn't working. For default, no multiprocessing, I'm using list comprehension. \
        For multiprocessing, I've tried futures.ProcessPoolExecutor and ray, both seem \
        to do much worse than list comprehension. I don't understand why they doesn't do \
        better. My code is below, which uses just the text of this post to benchmark."

    s = s.translate(str.maketrans('', '', string.punctuation)).split()
    c = combinations(s, 2)
    start = time.time()
    f(c)
    elapsed = time.time() - start
    print(f"{f.__name__} completed in {elapsed} seconds")


def listcomp(c):
    [distance(a) for a in c]


def rayray(c):
    # ray.init(num_cpus=6)
    ray.get([distance2.remote(a) for a in c])


def concurrent(c):
    with futures.ProcessPoolExecutor() as executor:
        list(executor.map(distance, c))


bench(listcomp)
bench(concurrent)
bench(rayray)

Which outputs:

listcomp completed in 16.1796932220459 seconds
concurrent completed in 102.94504928588867 seconds
rayray completed in 163.66783547401428 seconds

You should put the main code in a condition. Read the doc of multiprocessing. — Jérôme Richard, Jul 01 '21 at 18:10
Of course, i seem to have figured out my own answer. The overhead of multiprocessing is more expensive per process or whatever, such that it makes running each thread slower than simply running each combination by itself. So, the solution is to make larger chunks and run those through multiprocessing. Which I haven't thoroughly tested yet but I believe is faster. — Adam Boyher, Jul 01 '21 at 19:19
See https://docs.python.org/3/library/multiprocessing.html and https://docs.python.org/3/library/multiprocessing.html#multiprocessing-programming to start with. Indeed, the chunk size is very important since inter-process communication is slow. The creation of process is slow too but it should not be a problem for such a case. Threading is theoretically, but you cannot use threads when you deal with pure-CPython types and external libraries because of the GIL (Global Interpreter Lock). This makes Python not a great language for parallelism in many cases (especially with CPython). — Jérôme Richard, Jul 01 '21 at 22:10

Parallelizing phonetic distance between all pairwise combinations of words in a document

0 Answers0