10

I have 20,000 documents that I want to compute the true Jaccard similarity for, so that I can later check how accurately MinWise hashing approximates it.

Each document is represented as a column in a numpy matrix, where each row is a word that either appears in document (entry=1) or does not (entry = 0). There are ~600 words (rows).

So for example column 1 would be [1 0 0 0 0 0 1 0 0 0 1 0], which means words 1,7,11 appeared in it and no others.

Is there a more efficient way to compute the similarity besides my element-wise comparison approach? I don't see how I could use sets to improve the speed since the sets just become (0,1), but as it stands the code is impossibly slow.

import numpy as np

#load file into python
rawdata = np.loadtxt("myfile.csv",delimiter="\t")
#Convert the documents from rows to columns
rawdata = np.transpose(rawdata)
#compute true jacard similarity
ndocs = rawdata.shape[1]
nwords = rawdata.shape[0]
tru_sim = np.zeros((ndocs,ndocs))

#computes jaccard similarity of 2 documents
def jaccard(c1, c2):
    n11 = sum((c1==1)&(c2==1))
    n00 = sum((c1==0)&(c2==0))
    jac = n11 / (nfeats-n00)
    return (jac)

for i in range(0,ndocs):
    tru_sim[i,i]=1
    for j in range(i+1,ndocs):
        tru_sim[i,j] = jaccard(rawdata[:,i],rawdata[:,j])
Divakar
  • 218,885
  • 19
  • 262
  • 358
Magic8ball
  • 145
  • 1
  • 2
  • 8
  • 4
    Have you seen [scipy.spatial.distance.jaccard](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html)? Use [`scipy.spatial.distance.pdist`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html) with `metric='jaccard'`. Subtract that from 1 to get the similarity. – Warren Weckesser Nov 13 '16 at 23:27
  • Another good suggestion, especially since you can use spicpy.spatial.distance.squareform to get the matrix back easily . https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.squareform.html#scipy.spatial.distance.squareform – Magic8ball Nov 14 '16 at 23:57

2 Answers2

6

Here's a vectorized approach -

# Get the row, col indices that are to be set in output array        
r,c = np.tril_indices(ndocs,-1)

# Use those indicees to slice out respective columns 
p1 = rawdata[:,c]
p2 = rawdata[:,r]

# Perform n11 and n00 vectorized computations across all indexed columns
n11v = ((p1==1) & (p2==1)).sum(0)
n00v = ((p1==0) & (p2==0)).sum(0)

# Finally, setup output array and set final division computations
out = np.eye(ndocs)
out[c,r] = n11v / (nfeats-n00v)

Alternative way to compute n11v and n00v with np.einsum -

n11v = np.einsum('ij,ij->j',(p1==1),(p2==1).astype(int))
n00v = np.einsum('ij,ij->j',(p1==0),(p2==0).astype(int))

If rawdata consists of 0s and 1s only, a simpler way to get them would be -

n11v = np.einsum('ij,ij->j',p1,p2)
n00v = np.einsum('ij,ij->j',1-p1,1-p2)

Benchmarking

Function definitions -

def original_app(rawdata, ndocs, nfeats):
    tru_sim = np.zeros((ndocs,ndocs))
    for i in range(0,ndocs):
        tru_sim[i,i]=1
        for j in range(i+1,ndocs):
            tru_sim[i,j] = jaccard(rawdata[:,i],rawdata[:,j])
    return tru_sim

def vectorized_app(rawdata, ndocs, nfeats):
    r,c = np.tril_indices(ndocs,-1)
    p1 = rawdata[:,c]
    p2 = rawdata[:,r]
    n11v = ((p1==1) & (p2==1)).sum(0)
    n00v = ((p1==0) & (p2==0)).sum(0)
    out = np.eye(ndocs)
    out[c,r] = n11v / (nfeats-n00v)
    return out

Verification and timings -

In [6]: # Setup inputs
   ...: rawdata = (np.random.rand(20,10000)>0.2).astype(int)
   ...: rawdata = np.transpose(rawdata)
   ...: ndocs = rawdata.shape[1]
   ...: nwords = rawdata.shape[0]
   ...: nfeats = 5
   ...: 

In [7]: # Verify results
   ...: out1 = original_app(rawdata, ndocs, nfeats)
   ...: out2 = vectorized_app(rawdata, ndocs, nfeats)
   ...: print np.allclose(out1,out2)
   ...: 
True

In [8]: %timeit original_app(rawdata, ndocs, nfeats)
1 loops, best of 3: 8.72 s per loop

In [9]: %timeit vectorized_app(rawdata, ndocs, nfeats)
10 loops, best of 3: 27.6 ms per loop

Some magical 300x+ speedup there!

So, why is it so fast? Well, there are lots of factors involved, most important one being the fact NumPy arrays are built for performance and optimized for vectorized computations. With the proposed approach we are harnessing it quite well and thus seeing such speedups.

Here's one related Q&A that talk about these performance criteria in detail.

Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • My data does consist of only 1s and 0s. Can you explain why this is more computationally efficient than the approach I used? – Magic8ball Nov 13 '16 at 22:59
  • @Magic8ball Added runtime test and some comments on why it's efficient. Check it out! – Divakar Nov 13 '16 at 23:22
  • 1
    Thanks for your feedback. Right now I'm getting a MemoryError on the step p1 = rawdata[:,c] , since it is an array with about 232 million entries, so I'm not sure if this particular code is scalable to my project but the ideas are helpful. – Magic8ball Nov 13 '16 at 23:48
  • @Magic8ball Ah wow! So, without falling back to your loopy approach, you can work in the middle-ground by processing data in chunks, if you can get through the first step of : `r,c = np.tril_indices(ndocs,-1)`. Then, use chunks of indices from `r` and `c` and use a loop to process chunks. – Divakar Nov 13 '16 at 23:55
  • Mind including scipy.spatial.distance.pdist in your benchmark as well? – Has QUIT--Anony-Mousse Nov 14 '16 at 06:48
  • @Anony-Mousse Well with `pdist` version I am not sure how to incorporate `nfeats` as a param. But, I tried without the param and `pdist` had `11.2 ms per loop` against `26.8 ms per loop` for the vectorized one. So, a definite improvement there, if only we could figure out how to use `pdist` with that param. – Divakar Nov 14 '16 at 07:07
  • What's `n_feats`, and why do you need it? https://en.m.wikipedia.org/wiki/Jaccard_index Also, in your benchmark, doesn't this say "vectorized" is 3x *slower*? – Has QUIT--Anony-Mousse Nov 14 '16 at 07:17
  • @Anony-Mousse I am not familiar with Jaccard index, but all I know is OP has that as a param :) `""vectorized" is 3x slower? "`? – Divakar Nov 14 '16 at 07:38
  • Oh, sorry, I missed the unit. 8 ms rather than 8 s would have been faster. But nevertheless, `nfeats` should be `nwords`. Also beware that your benchmark may be misleading when nfeats is really large, and really sparse. – Has QUIT--Anony-Mousse Nov 14 '16 at 13:11
  • @Anony-Mousse is correct, I used "nfeats" (number of features) but nwords is clearer. – Magic8ball Nov 14 '16 at 22:05
  • @Magic8ball So, `nfeats = nowrds`? – Divakar Nov 14 '16 at 22:06
  • 1
    @Divakar Yes nfeats = nwords. – Magic8ball Nov 14 '16 at 22:48
  • 2
    @Anony-Mousse is correct. This doesn't scale with features due to huge memory consumption. – beginner_ Apr 03 '18 at 16:19
  • Memory issues? For datasets on 100k range or more, try Minhash LSH for an approximation of Jaccard: https://github.com/mattilyra/LSH/blob/master/examples/Introduction.ipynb – fjsj Dec 07 '18 at 04:05
3

To compute Jaccard, use:

def jaccard(x,y):
  x = np.asarray(x, np.bool) # Not necessary, if you keep your data
  y = np.asarray(y, np.bool) # in a boolean array already!
  return np.double(np.bitwise_and(x, y).sum()) / np.double(np.bitwise_or(x, y).sum())

print jaccard([1,1,0,0,0],[0,1,0,0,1])
>>> 0.33333333333333331
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194