0

Im trying to understand doc2vec and can I use it to solve my scenario. I want to label sentences with 1 or more tags using TaggedSentences([words], [tags]), but im unsure If my understanding is correct.

so basically, i need this to happen(or am I totally off the mark)

I create 2 TaggedDocuments

TaggedDocument(words=["the", "bird", "flew", "over", "the", "coocoos", "nest", labels=["animal","tree"])
TaggedDocument(words=["this", "car", "is", "over", "one", "million", "dollars", labels=["motor","money"])

I build my model

model = gensim.models.Doc2Vec(documents, dm=0, alpha=0.025, size=20, min_alpha=0.025, min_count=0)

Then I train my model

model.train(documents, total_examples=len(documents), epochs=1)

So when I have all that done, what I expect is when I execute

model.most_similar(positive=["bird", "flew", "over", "nest])

is [animal,tree], but I get

[('the', 0.4732949137687683), 
('million', 0.34103643894195557),
('dollars', 0.26223617792129517),
('one', 0.16558100283145905),
('this', 0.07230066508054733),
('is', 0.012532509863376617),
('cocos', -0.1093338280916214),
('car', -0.13764989376068115)]

UPDATE: when I infer

vec_model = model.Word2Vec.load(os.path.join("save","vec.w2v"))
infer = vec_model.infer_vector(["bird", "flew", "over", "nest"])
print(vec_model.most_similar(positive=[infer], topn=10))

I get

[('bird', 0.5196993350982666),
('car', 0.3320297598838806), 
('the',  0.1573483943939209), 
('one', 0.1546170711517334), 
('million',  0.05099521577358246),
('over', -0.0021460093557834625), 
('is',  -0.02949431538581848),
('dollars', -0.03168443590402603), 
('flew', -0.08121247589588165),
('nest', -0.30139490962028503)]

So the elephant in the room, Is doc2vec what I need to accomplish the above scenario, or should I go back to bed and have a proper think about what Im trying to achieve in life :)

Any help greatly appreciated

rogger2016
  • 821
  • 3
  • 11
  • 28

1 Answers1

1

It's not clear what your goal is.

Your code examples are a bit muddled; there's no way the TaggedDocument constructions, as currently shown, will result in good text examples. (words needs to be a list of words, not a string with a bunch of comma-separated tokens.)

If you ask model for similarities, you'll get words – if you want doc-tags, you'll have to ask the model's docvecs sub-property. (That is, model.docvecs.most_similar().)

Regarding your training parameters, there's no good reason to change the default min_alpha to be equal to the starting-alpha. A min_count=0, retaining all words, usually makes word2vec/doc2vec vectors worse. And the algorithm typically needs many passes over the data – usually 10 or more – rather than one.

But also, word2vec/doc2vec really needs bulk data to achieve its results – toy-sized tests rarely show the same beneficial properties that are possible with larger datasets.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • 1
    Big thanks for the reply gojomo :), so iv fixed the typos in the above code snippets (list of strings), Iv also tried to understand min_count and min_alpha a bit better. :) I re run the code using docvecs.most_similar() and I am indeed getting back the correct ranked labels that I expect. Im pretty new to ML and really appreciate the feed back. Not I have to get a bigger data set with some good data to play with. My Journey continues :) – rogger2016 Oct 11 '17 at 09:14
  • It's not clear what your goal is. > Im trying to label a sentence with labels from a similar sentence – rogger2016 Oct 11 '17 at 10:05
  • I'm also a bit confused as to if I have 100 docs with unique sentences and labels, if I run a query that exactly matches a sentence I'd expect to get a specific label...each time it's gives me different label...should this happen? – rogger2016 Oct 11 '17 at 21:58
  • 1
    There's inherent randomness used in Doc2Vec training/inference, so you won't get identical vectors from run-to-run (without extra effort), But they should be *similar*, and moreso if the model/inference is being run with sufficient data & good parameters. When re-inference of a training text *doesn't* bring back as `most_similar()` that text's training tags, common reasons are: (1) needs more inference effort (esp. on small texts) - default `steps` often too few; (2) not preprocessing inference same as training; (3) too little data (or too large/'overfit' model). – gojomo Oct 12 '17 at 18:25
  • 1
    (Essentially with 'largish' model compared to 'smallish' data, model can get good at training objective without generalizable forced-'densification' of the input space... & thus the same/similar texts later might, with slightly diff starting randomizations, wind up with quite diff vectors, especially for short texts or few iterations. More iterations, more data, or smaller vector sizes *might* help... but really you want many tens-of-thousands or millions of examples, even to train up ~100+ dimensional vectors. 100 examples into 20 dimensions, and esp if examples are short, stretching it.) – gojomo Oct 12 '17 at 18:30