2

I get this error when I load the google pre-trained word2vec to train doc2vec model with my own data. Here is part of my code:

model_dm=doc2vec.Doc2Vec(dm=1,dbow_words=1,vector_size=400,window=8,workers=4)
model_dm.build_vocab(document)
model_dm.intersect_word2vec_format('home/xxw/Downloads/GoogleNews-vectors-negative300.bin',binary=True)
model_dm.train(document)

But I got this error:

'Doc2Vec' object has no attribute 'intersect_word2vec_format'

Can you help me with the error? I get the google model from https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz, and my gensim is the latest version I think.

Paolo
  • 3,825
  • 4
  • 25
  • 41
Xizi Wei
  • 23
  • 1
  • 3

1 Answers1

2

A recent refactor made Doc2Vec no longer share a superclass with this method. You might be able to call the method on your model_dm.wv object instead, but I'm not sure. Otherwise you could look at the source and mimic the code to achieve the same effect, if you really need that step.

But note that Doc2Vec doesn't need word-vectors as input: it can learn everything it needs from your own training data. Whether word-vectors from elsewhere will help will depend on a lot of factors – and the larger your own data is, or the more unique, the less preloaded vectors from elsewhere are likely to help, or even have any residual effect when your own training is done.

Other notes on your apparent setup:

  • dbow_words=1 will have no effect in dm=1 mode - that mode already inherently trains word-vectors. (It only has effect in dm=0 DBOW mode, where it adds extra interleaved word-training, if you need word-vectors. Often plain DBOW, without word-vector training, is a fast and effective option.)

  • Recent versions of gensim require more arguments to train, and note that typical published work with this algorithm use 10-20 (or sometimes more) passes over the data (as can be specified to train() via the epochs argument), rather than the default (in some versions of gensim) of 5.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Thank you for your answer, that helps a lot. I changed the code to : model_dm.wv.load_word2vec_format , and it works. The reason I wanna try pre-trained word2vec is that my training data is very limited, just around 5000 sentence each contains 5-20 words. I want to do text classification on this data set using SVM, test on 600 sentence. But the result is not that much better comparing to using tf-idf as vector. Accuracy around 0.65. Do you have any advice on that? Thank you very much . – Xizi Wei May 08 '18 at 19:34
  • I'm not sure `load_word2vec_format()` there will do what you want - it'll clobber whatever was discovered from `build_vocab()` with what was loaded, but the parent `Doc2Vec` model will only be expecting the original vocabulary-size, and vectors of 400 dimensions while those GoogleNews vectors are 300-dimensions... so I'm unsure what will happen, and would expect errors or random results. It's definitely an unsupported mode with undefined effects. – gojomo May 09 '18 at 06:42
  • With such a small dataset, I'd also try (1) getting more similar data, even if it's unlabeled; (2) smaller `size` and more `iter`. With your main focus being classification, you might also try original Facebook FastText or StarSpace, in their modes where the word/entity-vectors are tuned to predict classes. But in general such a small dataset isn't great for Word2Vec/Doc2Vec techniques, even if you can maybe get a littlebenefit over vectors/models trained on larger data. – gojomo May 09 '18 at 06:44