TaggedDocument
is an illustrative class to represent objects that Doc2Vec
can take as text examples. You don't need to use it – you just need to provide objects that have a words
property that is a list of string tokens, and a tags
property that's a list of tags to be associated with the document. (That is, you can provide your text examples as an object that is 'shaped' or 'duck-typed' like TaggedDocument
.)
TaggedLineDocument
is a utility class for taking a file that has one document per line, whose token-wordss are already whitespace-delimited, and turning it into an iterable collection of TaggedDocument
s, where each doc has as its only tag its integer line-number. As such, it's a minimal example of streaming texts to Doc2Vec
, for the common-case of a single doc-per-line text file as input, and no need for custom per-doc tags/IDs.
If your data is in other formats, you can't use TaggedLineDocument
directly, but it might be a useful starting point. If you're OK with simple tags numbered from 0 to the count of documents, you could transform your format to the single file TaggedLineDocument
expects.
Alternatively, and especially if you need to use custom tags, you'd write your own class that turns your data source – whether it's a set of files or an network resource or a database – into an iterable object which emits one TaggedDocument
-like object for each example.