1

There were a similar question here Gensim Doc2Vec Exception AttributeError: 'str' object has no attribute 'words', but it didn't get any helpful answers.

I'm trying to train Doc2Vec on 20newsgroups corpora. Here's how I build the vocab:

from sklearn.datasets import fetch_20newsgroups
    def get_data(subset):
        newsgroups_data = fetch_20newsgroups(subset=subset, remove=('headers', 'footers', 'quotes'))
        docs = []
        for news_no, news in enumerate(newsgroups_data.data):       
            tokens = gensim.utils.to_unicode(news).split() 
            if len(tokens) == 0:
                continue
            sentiment =  newsgroups_data.target[news_no]
            tags = ['SENT_'+ str(news_no), str(sentiment)]
            docs.append(TaggedDocument(tokens, tags))
        return docs

    train_docs = get_data('train')
    test_docs = get_data('test')
    alldocs = train_docs + test_docs

    model = Doc2Vec(dm=dm, size=size, window=window, alpha = alpha, negative=negative, sample=sample, min_count = min_count, workers=cores, iter=passes)
    model.build_vocab(alldocs)

Then I train the model and save the result:

model.train(train_docs, total_examples = len(train_docs), epochs = model.iter)
model.train_words = False
model.train_labels = True
model.train(test_docs, total_examples = len(test_docs), epochs = model.iter)

model.save(output)

The problem appears when I try to load the model: screen

I tried:

  • using LabeledSentence instead of TaggedDocument

  • yielding TaggedDocument instead of appending them to the list

  • setting min_count to 1 so no word would be ignored (just in case)

Also the problem occurs on python2 as well as python3.

Please, help me solve this.

Community
  • 1
  • 1

1 Answers1

1

You've hidden the most important information – the exact code that triggers the error, and the error text itself – in the offsite (imgur) 'screen' link. (That would be the ideal text to cut & paste into the question, rather than other steps that seem to run OK, without triggering the error.)

Looking at that screenshot, there's the line:

model = Doc2Vec("20ng_infer")

...which triggers the error.

Note that none of the arguments as documented for the Doc2Vec() initialization method are a plain string, like the "20ng_infer" argument in the above line – so that's unlikely to do anything useful.

If trying to load a model that was previously saved with model.save(), you should use Doc2Vec.load() – which will take a string describing a local file path from which to load the model. So try:

model = Doc2Vec.load("20ng_infer")

(Note also that larger models might be saved to multiple files, all starting with the string you supplied to save(), and these files must be kept/moved together to again re-load() them in the future.)

gojomo
  • 52,260
  • 14
  • 86
  • 115