6

Background

I am trying to judge whether a phrase is semantically related to other words found in a corpus using Gensim. For example, here is the corpus document pre-tokenized:

 **Corpus**
 Car Insurance
 Car Insurance Coverage
 Auto Insurance
 Best Insurance
 How much is car insurance
 Best auto coverage
 Auto policy
 Car Policy Insurance

My code (based on this gensim tutorial) judges the semantic relatendness of a phrase using cosine similarity against all strings in corpus.

Problem

It seems that if a query contains ANY of the terms found within my dictionary, that phrase is judged as being semantically similar to the corpus (e.g. **Giraffe Poop Car Murderer has a cosine similarity of 1 but SHOULD be semantically unrelated). I am not sure how to solve for this issue.

Code

#Tokenize Corpus and filter out anything that is a stop word or has a frequency <1
texts = [[word for word in document if word not in stoplist]
        for document in documents]
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
        for text in texts]
dictionary = corpora.Dictionary(texts)

# doc2bow counts the number of occurences of each distinct word, converts the word
# to its integer word id and returns the result as a sparse vector

corpus = [dictionary.doc2bow(text) for text in texts]  
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "giraffe poop car murderer"
vec_bow = dictionary.doc2bow(doc.lower().split())

#convert the query to LSI space
vec_lsi = lsi[vec_bow]              
index = similarities.MatrixSimilarity(lsi[corpus])

# perform a similarity query against the corpus
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
user3682157
  • 1,625
  • 8
  • 29
  • 55

1 Answers1

13

First of all, you are not directly comparing the cosine similarity of bag-of-word vectors, but first reducing the dimensionality of your document vectors by applying latent semantic analysis (https://en.wikipedia.org/wiki/Latent_semantic_analysis). This is fine, but I just wanted to emphasise that. It is often assumed that the underlying semantic space of a corpus is of a lower dimensionality than the number of unique tokens. Therefore, LSA applies principal component analysis on your vector space and only keeps the directions in your vector space that contain the most variance (i.e. those directions in the space that change most rapidly, and thus are assumed to contain more information). This is influenced by the num_topics parameters you pass to the LsiModel constructor.

Secondly, I cleaned up your code a little bit and embedded the corpus:

# Tokenize Corpus and filter out anything that is a
# stop word or has a frequency <1

from gensim import corpora, models, similarities
from collections import defaultdict

documents = [
    'Car Insurance',  # doc_id 0
    'Car Insurance Coverage',  # doc_id 1
    'Auto Insurance',  # doc_id 2
    'Best Insurance',  # doc_id 3
    'How much is car insurance',  # doc_id 4
    'Best auto coverage',  # doc_id 5
    'Auto policy',  # doc_id 6
    'Car Policy Insurance',  # doc_id 7
]

stoplist = set(['is', 'how'])

texts = [[word.lower() for word in document.split()
          if word.lower() not in stoplist]
         for document in documents]

print texts
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
         for text in texts]
dictionary = corpora.Dictionary(texts)

# doc2bow counts the number of occurences of each distinct word,
# converts the word to its integer word id and returns the result
# as a sparse vector

corpus = [dictionary.doc2bow(text) for text in texts]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "giraffe poop car murderer"
vec_bow = dictionary.doc2bow(doc.lower().split())

# convert the query to LSI space
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])

# perform a similarity query against the corpus
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])

print sims

If I run the above I get the following output:

[(0, 0.97798139), (4, 0.97798139), (7, 0.94720691), (1, 0.89220524), (3, 0.61052465), (2, 0.42138112), (6, -0.1468758), (5, -0.22077486)]

where every entry in that list corresponds to (doc_id, cosine_similarity) ordered by cosine similarity in descending order.

As in your query document, the only word that is actually part of your vocabulary (constructed from your corpus) is car, all other tokens will be dropped. Therefore, your query to your model consists of the singleton document car. Consequently, you can see that all documents which contain car are supposedly very similar to your input query.

The reason why document #3 (Best Insurance) is ranked highly as well is because token insurance often co-occurs with car (your query). This is exactly the reasoning behind distributional semantics, i.e. "a word is characterized by the company it keeps" (Firth, J. R. 1957).

cvangysel
  • 301
  • 2
  • 7
  • First off, thank you very much for your answer -- the problem I have is that I don't necessarily want the other tokens dropped in when comparing the query! Yes, "Giraffe Car Murderer" has car in it but the phrase itself is nonsensical -- is there a way to get around this issue? In other words, keep the dropped tokens and use their lack of similarity to the corpus to weight down cosine similarity? – user3682157 Aug 14 '15 at 18:36
  • Well, you have a data sparsity issue here. In the current case your corpus does not know anything about those garbage terms. You can add them to your vocabulary, and that will help for your cause if you compute cosine similarities between raw document BoW vectors. However, you are creating a LSI model and therefore you are removing the dimensions in your data that do not contain much information. If you add those terms to your vocabulary, and then there's no documents that actually contain them, then LSI will lose that information first. – cvangysel Aug 15 '15 at 15:30
  • Ideally what you want is a larger example corpus where there are documents in your corpus that contain those garbage terms. In that way the algorithm will be able to pick up the difference between documents that contain them and documents that do not contain them, and that should give a noticeable effect in the cosine similarity in latent semantic space. By the way, comparing simple BoW vectors is also a form of semantic matching. It has just been proven that it isn't as effective as LSI once your vocabulary becomes larger. – cvangysel Aug 15 '15 at 15:34