3

When I train Doc2vec (using Gensim's Doc2vec in Python) on corpus of about 10k documents (each has few hundred words) and then infer document vectors using the same documents, they are not at all similar to the trained document vectors. I would expect they would be at least somewhat similar.

That is I do model.docvecs['some_doc_id'] and model.infer_vector(documents['some_doc_id']).

Cosine distances between trained and inferred vectors for few first documents:

0.38277733326
0.284007549286
0.286488652229
0.173178792
0.370117008686
0.275438070297
0.377647638321
0.171194493771
0.350615143776
0.311795353889
0.342757165432

As you can see, they are not really similar. If the similarity is so terrible even for documents used for training, I can't even begin to try to infer unseen documents.

Training configuration:

model = Doc2Vec(documents=documents, dm=1, size=100, window=6, alpha=0.1, workers=4, 
seed=44, sample=1e-5, iter=15, hs=0, negative=8, dm_mean=1, min_alpha=0.01, min_count=2)

Inferring:

model.infer_vector(tokens, steps=20, alpha=0.025)

Note on the side: Documents are always preprocessed the same way (I checked that the same list of tokens goes into training and into inferring).

Also I played with parameters around a bit, too, and results were similar. So if your suggestion would be something like "try increasing or decreasing this or that training parameter", I've most likely tried it. Maybe I just didn't come across the 'correct' parameters though.

Thanks for any suggestions as to what can I do to make it work better.

EDIT: I am willing and able to use any other available Python implementation of paragraph vectors (doc2vec). It doesn't have to be this one. If you know of another that can achieve better results.

EDIT: Minimal working example

import fnmatch
import os
from scipy.spatial.distance import cosine
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from keras.preprocessing.text import text_to_word_sequence

files = {}
folder = 'some path'  # each file contains few regular sentences
for f in fnmatch.filter(os.listdir(folder), '*.sent'):
    files[f] = open(folder + '/' + f, 'r', encoding="UTF-8").read()

documents = []
for k, v in files.items():
    words = text_to_word_sequence(v, lower=True)  # converts string to list of words, removes commas etc.
    documents.append(TaggedDocument(tags=[k], words=words))

d2 = Doc2Vec(size=200, documents=documents)

for doc in documents:
    trained = d2.docvecs[doc.tags[0]]
    inferred = d2.infer_vector(doc.words, steps=50)
    print(cosine(trained, inferred))  # cosine similarity from scipy
awa993
  • 177
  • 2
  • 14

1 Answers1

2

What is the type of your documents object, and are you sure that it is a multiply-iterable object, so that the model can do all of its 16 passes over the set of TaggedDocument-shaped text examples? That is, does iter(documents) always return a fresh iterator, with all items as TaggedDocument-shaped objects with the right list-of-words in words and list-of-tags in tags? (A common error is to supply a corpus that can be iterated over only once, and then ignoring any logged hints/warnings that no real training has happening. The inference/similarity results from such a model will be essentially random.)

Then for infer_vector(), does documents[tag] really return just the list-of-words it expects (not TaggedDocument or string)? (Users often supply strings, rather than lists-of-tokens, for training or inference words and get results that are just noise.)

Was there evaluation-guided reason for changing various defaults, either a little (window=6, negative=8) or a lot (alpha=0.1, min_count=2)? Such tweaks may not be a major factor in your problem, and there's nothing magical about the class defaults. But until you have the basics working, it's best to stick close to common configuration. (And then even after the basics are working, limit changes to those that can be demonstrated as better via a repeatable scoring process.)

Some report needing much higher steps values – 100 or more – to get better inference results, though that would be most crucial for very-small documents (of a handful to couple dozen words) rather than the few-hundred-words documents you describe.

A corpus of 10k documents is on the small side for Paragraph Vectors (Doc2Vec), but with your smallish vector-size (100) and larger number of iterations (15), it might be workable.

If you're still having problems, you should expand your question with more code showing how documents works, some suggestive example documents, and your cosine-similarity evaluation process – to see if there are any oversights at each of those steps.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Ad 1/ `documents` object is a python list. Each document is just a list of words. Ad 2/ `Documents[tag]` returns what you would expect - same list of words that were used for training - I am 100% sure. Ad 3/ No guide. Like I said, I played around with it. Don't worry, I tried the defaults as well and it's not better with them. Ad 4/ I tried it with 100, too. Didn't help. Ad 5/ I have another corpus at hand, that has way more documents, but I haven't gotten around to it yet. I also tried with bigger/smaller vectors and more/less iterations. Same bad results. Ad 6/ cos. similarity is from scipy – awa993 Mar 07 '18 at 21:55
  • Ad 6/ I can prepare some MWE (or rather not working) I guess, but that will take some time. – awa993 Mar 07 '18 at 22:01
  • If `documents` is in fact a python list of lists-of-strings, that wouldn't work as the 'documents' argument to `Doc2Vec` – which would need a list-of-`TaggedDocument`-objects. And also if `documents` is a plain list, you can't []-access a list-of-words by `documents['some_doc_id']`. So your answers & code are not yet fully consistent. If you're still having problems, you should add code & observed output to your question showing how `documents` is set up, how the necessary `TaggedDocument` objects are created, and showing example `words` lists, as retrieved & passed to `infer_vector()`. – gojomo Mar 08 '18 at 00:58
  • I am simplifying for the sake of this question. It's not the same collection. For training I send there list of TaggedDocument objects. For inferring I have dictionary of {'doc_id', list-of-strings} which the TaggedDocument list is created from. I double checked and debugged all of this stuff, there's nothing wrong there. I will prepare MWE anyway once I get home. You know if I was making these kind of mistakes, it wouldn't even run. For example Doc2vec would throw an error if you send there anything else than TaggedDocuments. Or Python wouldn't like using documents['id'] on list. – awa993 Mar 08 '18 at 10:19
  • 1
    Ok, I added MWE. – awa993 Mar 08 '18 at 16:01
  • The code is helpful, thanks, & looks generally correct. The default I'd change (moreso than `size`) would be train iteration-count - 10-20 common in Paragraph Vectors published work, though gensim `Doc2Vec` class inherits a measly 5 from its superclass. Further, with few training examples – and 10k still small for PV – you may need to choose *smaller* vectors (`size` even less than default 100) & benefit from even more than 20 training-passes. (Otherwise, model very prone to overfitting, & any text gets many equally-good-places to wind up post-inference – so less consistency from run-to-run.) – gojomo Mar 08 '18 at 19:42
  • But for full context, could you add to the example, & show: (1) the output of: `len(documents)`; (2) for at least the 1st 10 inferred docs, `len(doc)` & `len([w for w in doc if w in d2.wv.vocab])` & the measured cosine-distance? – gojomo Mar 08 '18 at 19:42