I am trying to calculate the phonetic distance between every word in a document. My typical document is on the order of 30,000 unique words, pairwise that is on the order of 500,000,000 (n*(n-1)/2)
combinations to compute. I am using pyphonetics and RefinedSoundex to compute the distance, which is rather quick and computes a single distance in a few microseconds. A full document would then take around five hours. I've been trying to parallelize this, but I can't figure out why it isn't working. For default, no multiprocessing, I'm using list comprehension. For multiprocessing, I've tried futures.ProcessPoolExecutor
and ray
, both seem to do much worse than list comprehension. I don't understand why they doesn't do better. My code is below, which uses just the text of this post to benchmark.
from multiprocessing import Pool
import time
import string
import numpy as np
from concurrent import futures
import pandas as pd
from itertools import combinations
from pyphonetics import RefinedSoundex
import ray
ray.init(num_cpus=6)
rs = RefinedSoundex()
def distance(a):
# time.sleep(1)
return rs.distance(a[0], a[1])
@ray.remote
def distance2(a):
# time.sleep(1)
return rs.distance(a[0], a[1])
def bench(f):
s = "I am trying to calculate the phonetic distance between every word in a document. \
My typical document is on the order of thirty thousand words, pairwise that is on \
the order of five hundred \million combinations to compute. I am using pyphonetics \
and RefinedSoundex to compute the distance, which is rather quick and computes a \
single distance in a few microseconds. A full document would then take around \
five hours. I've been trying to parallelize this, but I can't figure out why it \
isn't working. For default, no multiprocessing, I'm using list comprehension. \
For multiprocessing, I've tried futures.ProcessPoolExecutor and ray, both seem \
to do much worse than list comprehension. I don't understand why they doesn't do \
better. My code is below, which uses just the text of this post to benchmark."
s = s.translate(str.maketrans('', '', string.punctuation)).split()
c = combinations(s, 2)
start = time.time()
f(c)
elapsed = time.time() - start
print(f"{f.__name__} completed in {elapsed} seconds")
def listcomp(c):
[distance(a) for a in c]
def rayray(c):
# ray.init(num_cpus=6)
ray.get([distance2.remote(a) for a in c])
def concurrent(c):
with futures.ProcessPoolExecutor() as executor:
list(executor.map(distance, c))
bench(listcomp)
bench(concurrent)
bench(rayray)
Which outputs:
listcomp completed in 16.1796932220459 seconds
concurrent completed in 102.94504928588867 seconds
rayray completed in 163.66783547401428 seconds