0

i want to have phrases in doc2vec and i use gensim.phrases. in doc2vec we need tagged document to train the model and i cannot tag the phrases. how i can do this?

here is my code

text = phrases.Phrases(text)
for i in range(len(text)):
    string1 = "SENT_" + str(i)

    sentence = doc2vec.LabeledSentence(tags=string1, words=text[i])
    text[i]=sentence

print "Training model..."
model = Doc2Vec(text, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)
Majid
  • 23
  • 5

1 Answers1

0

The invocation of Phrases() trains a phrase-creating-model. You later use that model on text to get back phrase-combined text.

Don't replace your original text with the trained model, as on your code's first line. Also, don't try to assign into the Phrases model, as happens in your current loop, nor access the Phrases model by integers.

The gensim docs for the Phrases class has examples of the proper use of the Phrases class; if you follow that pattern you'll do well.

Further, note that LabeledSentence has been replaced by TaggedDocument, and its tags argument should be a list-of-tags. If you provide a string, it will see that as a list-of-one-character tags (instead of the one tag you intend).

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • thank for your answer. in this case using "LabeledSentence" instead of TaggedDocument make no difference. my problem is where and how use Phrases with doc2vec? – Majid Aug 17 '16 at 04:06
  • There's nothing special about using Phrases with Doc2Vec; it's just a preprocessing step to change some word pairs into combined `word_pairs`. So I recommend you ignore the Doc2Vec aspect, avoid the specific errors I pointed out in your existing code, and match the way it's done in the examples in the gensim documentation. – gojomo Aug 17 '16 at 06:41
  • i solve the error you said but the main problem is output of `LabeledSententces` don't work with as input of `Phrases` . so i can't construct phrase with `LabeledSentenses`. inverse of this process can't done too because `LabeledSentences` can't tag phrases ! – Majid Aug 17 '16 at 07:52
  • Right, you should be constructing `TaggedDocument` instances using the *output* of a `Phrases` model, not the other way around. Get the `Phrases` part working, by following the example in the gensim docs. Only *after* you've confirmed that's working as expected, then take the resulting lists-of-tokens and use them as the `words` of `TaggedDocument` instances. – gojomo Aug 17 '16 at 20:33