1

I have a dataset constructed as a sparse weighted matrix for which I want to calculate weighted Jaccard index for downstream grouping/clustering, with inspiration from below article: http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36928.pdf

I'm facing a slight issue in finding the optimal way for doing the above calculation in Python. My current function to test my hypothesis is the following:

def weighted_jaccard_index(x,y):
    mins, maxs = 0, 0
    for i in np.arange(len(x)):
        mins += np.amin([x[i],y[i]])
        maxs += np.amax([x[i],y[i]])
    return mins/maxs

I then supply this through from scipy.spatial.distance import pdist by passing my defined function weighted_jaccard_index:

w_j = pdist(X, weighted_jaccard_index)

But not very surprisingly I am seeing big performance issues. I am currently investigating using MinHash with the datasketch package, but I am happy to take input on how this would be best achieved. I think something like this would be possible:

df_np = df_gameinfo.to_numpy()
mg = ds.WeightedMinHashGenerator(20,20000)
lsh = ds.MinHashLSH(threshold=0.2,num_perm=20000)

#Create index in LSH
for i in np.arange(len(df_np)):
    vc = df_arr[i]
    m_hash = mg.minhash(vc)
    lsh.insert(i,m_hash)

test_ex = df.iloc[9522].to_numpy() #Random observations to calculate distance for
test_ex_m = mg.minhash(test_ex)

lsh.query(test_ex_m)

Potentially using vectorization in Numpy, but I am bit unsure on how to phrase it.

The data is of size 20k observations and 30 dimension (NxM = 20.000x30).

Charmander_
  • 55
  • 1
  • 9

1 Answers1

2

You can use concatenate:

q = np.concatenate([x,y], axis=1)
np.sum(np.amin(q,axis=1))/np.sum(np.amax(q,axis=1))

%%timeit -r 10 -n 10 gives 131 µs ± 61.7 µs per loop (mean ± std. dev. of 10 runs, 10 loops each) for 100 by 10 arrays.

your original function gives: 4.46 ms ± 95.9 µs per loop (mean ± std. dev. of 10 runs, 10 loops each) for the same data

def jacc_index(x,y):
    q = np.concatenate([x,y], axis=1)
    return np.sum(np.amin(q,axis=1))/np.sum(np.amax(q,axis=1))
warped
  • 8,947
  • 3
  • 22
  • 49
  • Thanks pal, I will test this and get back to you! – Charmander_ Feb 26 '22 at 11:27
  • How would you in the most efficient way run this over the dataframe? Is it by using pdist()? Or something like weighted_jaccard_index(df, df.T) – Charmander_ Feb 27 '22 at 14:41
  • @Chooan see my edit – warped Feb 27 '22 at 15:04
  • yes I updated the function like-so, but then to actually apply it on the total data, would you use pdist? – Charmander_ Feb 27 '22 at 15:08
  • 1
    well, if you look at the documentation of pdist you see that the function takes `w` as an argument. `w` is assumed to be a vector with the weights for each value in your arguments `x` and `y`. Now the code in your question computes a scalar, i.e. only one value. You can feed that as a weight argument, but this would likely (i have not tested the function), just scale the result and not (de)emphasise aspects of the dataset before computing the result. – warped Feb 27 '22 at 15:19
  • yes the approach itself and result makes sense. I am just curious to understand if it is possible to run/achieve this in a more efficient way. Currently the code for the full set takes about 68 minutes, and the final square matrix becomes quite large. I am thinking that perhaps it would be possible to do a upper triangle construction, and also make sure that distances below a threshold is not considered. Trying to visualise this in networkx for the full set, which currently becomes too large. – Charmander_ Feb 27 '22 at 15:34