4

I am new to doc2vec and I wish to classify set of texts using it.

I am confused about TaggedDocument and TaggedLineDocument.

1) What is the difference between two? Is it that TaggedLineDocument is collection of TaggedDocuments?

2) If I have a directory containing all the files, How to generate feature vectors for them? Should I create a new file where each line contains text from the file from the directory?

dfault
  • 41
  • 2
  • better ask two separate questions and link them if necessary. "How to work with files in a directory?" says nothong about what you want to accomplish. – vfclists Jul 12 '17 at 04:55

1 Answers1

2

TaggedDocument is an illustrative class to represent objects that Doc2Vec can take as text examples. You don't need to use it – you just need to provide objects that have a words property that is a list of string tokens, and a tags property that's a list of tags to be associated with the document. (That is, you can provide your text examples as an object that is 'shaped' or 'duck-typed' like TaggedDocument.)

TaggedLineDocument is a utility class for taking a file that has one document per line, whose token-wordss are already whitespace-delimited, and turning it into an iterable collection of TaggedDocuments, where each doc has as its only tag its integer line-number. As such, it's a minimal example of streaming texts to Doc2Vec, for the common-case of a single doc-per-line text file as input, and no need for custom per-doc tags/IDs.

If your data is in other formats, you can't use TaggedLineDocument directly, but it might be a useful starting point. If you're OK with simple tags numbered from 0 to the count of documents, you could transform your format to the single file TaggedLineDocument expects.

Alternatively, and especially if you need to use custom tags, you'd write your own class that turns your data source – whether it's a set of files or an network resource or a database – into an iterable object which emits one TaggedDocument-like object for each example.

gojomo
  • 52,260
  • 14
  • 86
  • 115