0

I am working with 200k sentences and I want to find Jaccard similarity using minhash algorithm. but it becomes really slow because of two for loops. could someone suggest me some good implementation?

Below is my current code

from datasketch.minhash import MinHash

def eg1(data1, data2):
    m1 = MinHash()
    m2 = MinHash(enter code here)
    for d in data1:
        m1.update(d.encode('utf8'))
    for d in data2:
        m2.update(d.encode('utf8'))
    return m1.jaccard(m2)

jac_sim = []
for i_doc in range(len(shingles)-1):
    for j_doc in range(i_doc + 1, len(shingles)):
        jaccard_similarity = eg1(shingles[i_doc], shingles[j_doc])
        jac_sim.append(jaccard_similarity)
Sanket Badhe
  • 53
  • 1
  • 6

1 Answers1

0

The problem is that MinHash is calculated many times for the same input. By calculating MinHash signatures only once you should by able to save a lot of time:

signatures = []
for i_doc in range(len(shingles)):
    m = MinHash()
    for d in shingles[i_doc]:
        m.update(d.encode('utf8'))
    signatures.append(m)

jac_sim = []    
for i_doc in range(len(shingles)-1):
    for j_doc in range(i_doc + 1, len(shingles)):
        jaccard_similarity = signatures[i_doc].jaccard(signatures[j_doc])
        jac_sim.append(jaccard_similarity)
otmar
  • 386
  • 1
  • 9