1

Here is my problem :

let's say my two array are :

import numpy as np
first = np.array(["hello", "hello", "hellllo"])
second = np.array(["hlo", "halo", "alle"])

Now I want to get the matrix of distance between each element of the two arrays

so for example my distance function is :

def diff_len(string1, string2):
    return abs(len(string1) - len(string2))

So I I would like to get the matrix :

        hello       hello    hellllo

hlo    result1     result2   result3
halo   result4     result5   result6
alle   result7     result8   result9

So what I did was to compute row by row using vectorize function of Numpy :

vectorize_dist = np.vectorize(diff_len)

first = np.array(["hello", "hello", "hellllo"])
second = np.array(["hlo", "halo", "alle"])

vectorize_dist(first , "hlo")
vectorize_dist(first , "halo")
vectorize_dist(first , "alle")

matrix = np.array([vectorize_dist(first , "hlo"), vectorize_dist(first , "halo"), vectorize_dist(first , "alle")])
matrix

array([[2, 2, 4],
       [1, 1, 3],
       [1, 1, 3]])

But in order to get my matrix I need to execute a loop to compute row after row, but I would like to get the matrix at once. Indeed my two arrays could be very large and executing a loop could take too much time. also I have multiple distance to compute so I would have to execute the procedure multiple time which will be even more time consuming.

1 Answers1

1

You can use SciPy's cdist for that:

import numpy as np
from scipy.spatial.distance import cdist

def diff_len(string1, string2):
    return abs(len(string1) - len(string2))

first = np.array(["hello", "hello", "hellllo"])
second = np.array(["hlo", "halo", "alle"])
d = cdist(first[:, np.newaxis], second[:, np.newaxis], lambda a, b: diff_len(a[0], b[0]))
print(d.T)
# [[2. 2. 4.]
#  [1. 1. 3.]
#  [1. 1. 3.]]

Note that you would need to cast the output matrix type to make it integer though.

jdehesa
  • 58,456
  • 7
  • 77
  • 121
  • I have multiple distance matrix to compute, can I use multiprocessing to make use of all my cores to speed up full computation ? – Enzo Ramirez C. Nov 04 '20 at 15:36
  • @EnzoRamirezC. Yes, it should be possible to do that e.g. with a multiprocessing pool, although it has an overhead so whether or not you can get a significant performance gain will depend on the size of the problem. I'm also not sure if something like `cdist` use multiple cores already or not. You could also explore parallel Numba (in that case without `cdist`, just using loops), although to be effective your code would need to be natively jit-compilable. – jdehesa Nov 04 '20 at 17:45