I am working with 200k sentences and I want to find Jaccard similarity using minhash algorithm. but it becomes really slow because of two for loops. could someone suggest me some good implementation?
Below is my current code
from datasketch.minhash import MinHash
def eg1(data1, data2):
m1 = MinHash()
m2 = MinHash(enter code here)
for d in data1:
m1.update(d.encode('utf8'))
for d in data2:
m2.update(d.encode('utf8'))
return m1.jaccard(m2)
jac_sim = []
for i_doc in range(len(shingles)-1):
for j_doc in range(i_doc + 1, len(shingles)):
jaccard_similarity = eg1(shingles[i_doc], shingles[j_doc])
jac_sim.append(jaccard_similarity)