1

I'm a numpy baby and am looking at using numpy.vectorise() to compute a distance matrix. I think that a key part of this is the signature param, but when I run the code below I get an error:

import numpy as np
from scipy.spatial.distance import jaccard

#find jaccard dissimilarities for a constant 1 row * m columns array vs each array in an n rows * m columns nested array, outputting a 1 row * n columns array of dissimilarities   
vectorised_compute_jac = np.vectorize(jaccard, signature = '(m),(n,m)->(n)')

array_list = [[1, 2, 3], #arrA
              [2, 3, 4], #arrB
              [4, 5, 6]] #arrC

distance_matrix = np.array([])
for target_array in array_list:
    print (target_array)
    print (array_list)
    #row should be an array of jac distances between target_array and each array in array_list
    row = vectorised_compute_jac(target_array , array_list)
    print (row, '\n\n') 
    #np.vectorise() functions return an array of objects of type specified by otype param, based on docs
    np.append(distance_matrix, row)

Output + Error:

[1, 2, 3]
[[1, 2, 3], [2, 3, 4], [4, 5, 6]]
Traceback (most recent call last):

  File "C:\Users\u03132tk\.spyder-py3\ModuleMapper\untitled1.py", line 21, in <module>
    row = vectorised_compute_jac(array, array_list)

  File "C:\ANACONDA3\lib\site-packages\numpy\lib\function_base.py", line 2163, in __call__
    return self._vectorize_call(func=func, args=vargs)

  File "C:\ANACONDA3\lib\site-packages\numpy\lib\function_base.py", line 2237, in _vectorize_call
    res = self._vectorize_call_with_signature(func, args)

  File "C:\ANACONDA3\lib\site-packages\numpy\lib\function_base.py", line 2277, in _vectorize_call_with_signature
    results = func(*(arg[index] for arg in args))

  File "C:\ANACONDA3\lib\site-packages\scipy\spatial\distance.py", line 893, in jaccard
    v = _validate_vector(v)

  File "C:\ANACONDA3\lib\site-packages\scipy\spatial\distance.py", line 340, in _validate_vector
    raise ValueError("Input vector should be 1-D.")

ValueError: Input vector should be 1-D.

What I would like, with square brackets indicating numpy arrays not lists, based on array output types discussed in comments above:

  #arrA    #arrB   #arrC
[[JD(AA), JD(AB), JD(AC)],   #arrA
 [JD(BA), JD(BB), JD(BC)],   #arrB
 [JD(CA), JD(CB), JD(CC)]]   #arrC

Can someone advise how the signature param works and whether thats causing my woes? I suspect it's due to the (n, m) in my signature as it's the only multi-dimensional thing, hence the question :(

Cheers! Tim

Tim Kirkwood
  • 598
  • 2
  • 7
  • 18
  • 2
    Hi, you're aware that `np.vectorize` is not really vectorized, right? Then if you're fine with explicit loops, there is `[jaccard(arr_1, arr_2) for arr_1 in array_list for arr_2 in array_list]` which you can reshape later. – Mustafa Aydın Dec 28 '21 at 12:57
  • 1
    As I understand, this is a symmetric metric; so you can save time by doing only the ~half of it. You can also skip the trivial `d(x, x) = 0` ones, i.e., when the arrays are the same. The number of `jaccard` calls then reduces to *n(n-1)/2* from *n^2*. – Mustafa Aydın Dec 28 '21 at 12:59
  • 1
    Hi thanks for your reply! Yea I saw it was meant to be basically a for loop but I was hoping it was doing some behind the scenes magic to be a bit quicker :( I'd quite like to learn to vectorise more and i swear a vectorised solution should be fairly doable haha, ill keep thinking! – Tim Kirkwood Dec 28 '21 at 13:05
  • The `signature` mode of `np.vectorize` is even slower than the basic scalar mode. `np.vectorize` is NOT the numpy vectorization that gives it so much speed and power. – hpaulj Dec 28 '21 at 16:04

1 Answers1

0

I was going to run your code as is, but then saw that you were misusing np.append. So I'll skip your iteration, and try to recreate the calculation with straight forward list comprehensions.

It looks like jaccard takes 2 1d arrays, and returns a scalar, and you apparently want to calculate it for all pairs of your list of arrays.

In [5]: arr = np.array(array_list)
In [6]: [jaccard(arr[0],b) for b in arr]
Out[6]: [0.0, 1.0, 1.0]
In [7]: [[jaccard(a,b) for b in arr] for a in arr]
Out[7]: [[0.0, 1.0, 1.0], [1.0, 0.0, 1.0], [1.0, 1.0, 0.0]]
In [9]: np.array(_)
Out[9]: 
array([[0., 1., 1.],
       [1., 0., 1.],
       [1., 1., 0.]])

With symmetry and 0s it should be possible to cut down on the jaccard calls with a more selective iteration. But I'll leave that for others.

With your signature, you are telling vectorize to pass a 1d and a 2d array to the jaccard, and to expect back a 1d. That's not right.

This is, I think the correct use of vectorize:

In [12]: vectorised_compute_jac = np.vectorize(jaccard, signature = '(m),(m)->()
    ...: ')
In [13]: vectorised_compute_jac(arr[None,:,:],arr[:,None,:])
Out[13]: 
array([[0., 1., 1.],
       [1., 0., 1.],
       [1., 1., 0.]])

Compare its time against the nested comprehension:

In [14]: timeit vectorised_compute_jac(arr[None,:,:],arr[:,None,:])
384 µs ± 5.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [15]: timeit np.array([[jaccard(a,b) for b in arr] for a in arr])
203 µs ± 204 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [15], it's the jaccard calls that dominate the time, not the iteration mechanism. So taking advantage of the symmetry will worth it.

In [17]: timeit jaccard(arr[0],arr[1])
21.2 µs ± 79.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
hpaulj
  • 221,503
  • 14
  • 230
  • 353