14

I am using Doc2Vec function of gensim in Python to convert a document to a vector.

An example of usage

model = Doc2Vec(documents, size=100, window=8, min_count=5, workers=4)

How should I interpret the size parameter. I know that if I set size = 100, the length of output vector will be 100, but what does it mean? For instance, if I increase size to 200, what is the difference?

mamatv
  • 3,581
  • 4
  • 19
  • 25

2 Answers2

22

Word2Vec captures distributed representation of a word which essentially means, multiple neurons capture a single concept (concept can be word meaning/sentiment/part of speech etc.), and also a single neuron contributes to multiple concepts.

These concepts are automatically learnt and not pre-defined, hence you can think of them as latent/hidden. Also for the same reason, the word vectors can be used for multiple applications.

More is the size parameter, more will be the capacity of your neural network to represent these concepts, but more data will be required to train these vectors (as they are initialised randomly). In absence of sufficient number of sentences/computing power, its better to keep the size small.

Doc2Vec follows slightly different neural network architecture as compared to Word2Vec, but the meaning of size is analogous.

kampta
  • 4,748
  • 5
  • 31
  • 51
  • Hello, you means ``size`` is the number of neurons in neural network Doc2Vec used to train and output the vector? – mommomonthewind Jun 16 '16 at 14:21
  • number of neurons in each layer of neural network will depend on the architecture, whether DBOW or DM. Checkout the paper (mentioned in the answer) – kampta Jun 17 '16 at 02:53
8

The difference is the detail, that the model can capture. Generally, the more dimensions you give Word2Vec, the better the model - up to a certain point.

Normally the size is between 100-300. You always have to consider that more dimensions also mean, that more memory is needed.

Saytiras
  • 149
  • 7
  • Hello, thank you very much for your comment. But my question, what does the model "capture"? For instance, in TF model, if I set the size = 100, it will return 100 most frequent word - it's easy to understand. But in Doc2Vec, I do not really understand. – mamatv Jan 23 '16 at 12:50
  • The problem is that you simply can't say what effects more dimensions will have. You have to look at it in a different way. When you have 100 dimensions, you have only 100 variables to model the relationships of a word. But with 300 dimensions you have 300. So in theory it can capture more detail, because it has more variables to play with during training. Or short: Tweet vs Book, where would you find a more detailed overview over a topic? :D – Saytiras Jan 24 '16 at 23:07
  • Hello @Saytiras, I totally understand it :), but my question is, what does "100" mean. For instance, as I said, in TF model, 100 means 100 most frequent words in the text, so if I change the parameter to 200, it will return me 200 most frequent words. But in Doc2Vec, what does it really mean, in technical language? – mamatv Jan 25 '16 at 11:54
  • 4
    A size of 100 means the vector representing each document will contain 100 elements - 100 values. The vector maps the document to a point in 100 dimensional space. A size of 200 would map a document to a point in 200 dimensional space. The more dimensions, the more differentiation between documents. Image you only had a size of 2. This would map all documents into a 2D plane. It would soon get very crowded and would not provide a meaningful representation between each document. – John Wakefield Feb 15 '16 at 00:02