cosine similarity is 0.7 for exactly same sentences

Question

Cosine similarity for exactly two same sentences is 0.7. Is my doc2vec model correct? I am using quora question pairs dataset available in kaggle. In the code below, train1 is the list of first questions and train2 is the list of second questions.

docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i in range(0,len(train1)):
    words1 = train1[i]
    words2 = train2[i]
    tags1 = [2*i]
    tags2 = [2*i+1]
    docs.append(analyzedDocument(words1, tags1))
    docs.append(analyzedDocument(words2, tags2))

emb="glove.6B.100d.txt"
model = doc2vec.Doc2Vec(docs, vector_size = 300, window = 10, min_count = 1,pretrained_emb=emb)

score 0 · Answer 1 · answered Mar 31 '18 at 16:29

It appears you might be using gensim Doc2Vec, but out-of-the-box it doesn't support a pretrained_emb argument, and using pretrained word embeddings isn't necessarily a benefit for most applications, especially if you have adequate training documents. Also, you wouldn't normally be able to use 100-dimensional word-vectors, from somewhere else, to help seed a 300-dimensional Doc2Vec model. (I'm surprised whatever code you're using doesn't error from this mismatch.)

Typical published work using this algorithm ('Paragraph Vector') uses 10, 20 or more training passes, but (again assuming you're using gensim) you've left it at the default value of just 5.

Lowering the min_count to a non-default value of 1 usually makes results worse, as words with such few occurrences just serve as noise making the learned vectors for other documents/words less consistent.

Which two sentences are you comparing, and how?

Since the algorithm itself uses randomized initialization, and then several forms of random sampling during training, and then multi-threaded training adds some additional randomization of the order-of-text-processing, running Doc2Vec on the exact same corpus repeatedly won't usually obtain identical results.

Having the same text appear twice in the training-set, with different tags, won't necessarily get the same vector – though they should be similar. (They should generally become more similar with more training passes, but smaller documents may show more variance from text-to-text, or run-to-run, because with fewer target-words they get adjusted by the model-in-progress fewer times. (The same would occur for inferring vectors, post-model-training, for the same text repeatedly... though adjusting the infer_vector() values of steps or alpha might make results more stable from run-to-run.)

Thank you for your help. Yes, I am using gensim. ** can you tell me how to change number of training passes? ** And I have taken min_count as one because in there are some relevant words which only appeared once but they define the whole sentence. — Gautam Kumar, Apr 01 '18 at 19:23
Initialization parameter `iter` (or maybe in later versions `epochs`) controls the number of training passes: . Words that only appear once aren't going to get good word-vectors or help create good doc-vectors – the algorithm needs multiple varied examples to position the vectors in a balanced way. So if you have texts that have a single word, and that word is important, `min_count=1` will get you a vector – but it'll be pretty bad, and training all those singletons will make your other vectors worse too. — gojomo, Apr 02 '18 at 18:24

cosine similarity is 0.7 for exactly same sentences

1 Answers1