Background
I am trying to judge whether a phrase is semantically related to other words found in a corpus using Gensim. For example, here is the corpus document pre-tokenized:
**Corpus**
Car Insurance
Car Insurance Coverage
Auto Insurance
Best Insurance
How much is car insurance
Best auto coverage
Auto policy
Car Policy Insurance
My code (based on this gensim tutorial) judges the semantic relatendness of a phrase using cosine similarity against all strings in corpus.
Problem
It seems that if a query contains ANY of the terms found within my dictionary, that phrase is judged as being semantically similar to the corpus (e.g. **Giraffe Poop Car Murderer has a cosine similarity of 1 but SHOULD be semantically unrelated). I am not sure how to solve for this issue.
Code
#Tokenize Corpus and filter out anything that is a stop word or has a frequency <1
texts = [[word for word in document if word not in stoplist]
for document in documents]
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
for text in texts]
dictionary = corpora.Dictionary(texts)
# doc2bow counts the number of occurences of each distinct word, converts the word
# to its integer word id and returns the result as a sparse vector
corpus = [dictionary.doc2bow(text) for text in texts]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "giraffe poop car murderer"
vec_bow = dictionary.doc2bow(doc.lower().split())
#convert the query to LSI space
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])
# perform a similarity query against the corpus
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])