Find the percent of tokens shared by two documents with spacy

Question

for nltk it would be something like:

def symm_similarity(textA,textB):
    textA = set(word_tokenize(textA))
    textB = set(word_tokenize(textB))    
    intersection = len(textA.intersection(textB))
    difference = len(textA.symmetric_difference(textB))
    return intersection/float(intersection+difference)

Since spacy is faster, im trying to do it in spacy, but the token objects don't seem to offer a quick solution to this. Any ideas?

Thanks all.

score 2 · Accepted Answer · answered Jan 04 '17 at 18:55

Your function gets the percentage of word types shared, not tokens. You're taking the set of words, without sensitivity to their counts.

If you want counts of tokens, I expect the following to be very fast, so long as you have the vocabulary file loaded (which it will be by default, if you have the data installed):

from spacy.attrs import ORTH

def symm_similarity_types(nlp, textA,textB):
    docA = nlp.make_doc(textA)
    docB = nlp.make_doc(textB)
    countsA = Counter(docA.count_by(ORTH))
    countsB = Counter(docB.count_by(ORTH)
    diff = sum(abs(val) for val in (countsA - countsB).values())
    return diff / (len(docA) + len(docB))

If you want to compute exactly the same thing as your code above, here's the spaCy equivalent. The Doc object lets you iterate over Token objects. You should then base your counts on the token.orth attribute, which is the integer ID of the string. I expect working with integers will be a bit faster than sets of strings:

def symm_similarity_types(nlp, textA,textB):
    docA = set(w.orth for w in nlp(textA)
    docB = set(w.orth for w in nlp(textB) 
    intersection = len(textA.intersection(textB))
    difference = len(textA.symmetric_difference(textB))
    return intersection/float(intersection+difference)

This should be a bit more efficient than the NLTK version, because you're working with sets of integers, not strings.

If you're really concerned for efficiency, it's often more convenient to just work in Cython, instead of trying to guess what Python is doing. Here's the basic loop:

# cython: infer_types=True
for token in doc.c[:doc.length]
    orth = token.lex.orth

doc.c is a TokenC*, so you're iterating over contiguous memory and dereferencing a single pointer (token.lex is a const LexemeC*)

thanks syllogism! you are correct, i was doing words but for my application words or tokens work fine. This definitely will be useful to a lot of people other than myself as well. appreciate your help! — negfrequency, Jan 04 '17 at 19:10

Find the percent of tokens shared by two documents with spacy

1 Answers1