Doc2Vec input format

Question

running gensim Doc2Vec over ubuntu

Doc2Vec rejects my input with the error

AttributeError: 'list' object has no attribute 'words'

    import gensim from gensim.models  
    import doc2vec as dtv
    from nltk.corpus import brown
    documents = brown.tagged_sents()
    d2vmodel = > dtv.Doc2Vec(documents, size=100, window=1, min_count=1, workers=1)

I have tried already from this SO question and many variations with the same result

documents = [brown.tagged_sents()} adding a hash function

If corpus is a .txt file I can utilize

    documents=TaggedLineDocument(documents)

but that is often not possible

score 1 · Accepted Answer · answered Jun 22 '18 at 18:49

Gensim's Doc2Vec requires each document to be in the form of an object with a words property that is a list of string tokens, and a tags property that is a list of tags. These tags are usually strings, but expert users with large datasets can save a little memory by using plain-ints, starting from 0, instead.

A class TaggedDocument is included that is of the right 'shape', and used in most of the Gensim documentation/tutorial examples – but given Python's 'duck typing', any object with words and tags properties will do.

But a plain list won't.

And if I understand correctly, brown.tagged_sents() will return lists of (word, part-of-speech-tag) tuples, which isn't even the kind of list-of-word-tokens that would work as a words, and doesn't supply any of the full-document tags that are what Doc2Vec needs as keys to the doc-vectors that get trained.

Separately: it is unlikely you'd want to use min_count=1. Discarding very-low-frequency words usually makes retained Word2Vec/Doc2Vec vectors better.

thanks @gojomo ; how would a user convert a document from another format to a TaggedDocument if it is not in a .txt file. a link to the current docs would be helpful. — Lcat, Jun 22 '18 at 20:10
It depends on where the data is coming from. The tutorial notebook at shows in cell 3 a function which, while it reads from a file, constructs `TaggedDocument` instances. You could grab whatever your own `words` and `tags` are and similarly create a list-of-TaggedDocuments. — gojomo, Jun 23 '18 at 01:47

Doc2Vec input format

1 Answers1