For example, if I choose two window size, 5 and 50, and train the word2vec model, will the 50 one takes more time to train? Will the embeddings of the 50 one concentrates more on semantics of the text and the 5 one concentrates more on single word? BTW, above two questions are just my thinking/exmaples of what I am seeking. My real question is just the title "How is the window size affect word2vec and how do we choose window size according to different tasks?"
1 Answers
A larger window
will take longer to train.
A larger window will have a stronger effect on runtime in 'skip-gram' mode, where a larger window means more individual center-word predictions & error-backpropagations. It'll have a milder effect on runtime in 'CBOW' mode, where it just means more averaging of input-vectors and fan-out of the final effects for each prediction/backpropagation.
For how it affects the character of the resulting word-vectors, there's some discussion & a related research paper in a prior answer: Word2Vec: Effect of window size used
Generally, you'd optimize the window
value the same as any other tunable parameter, by devising some repeatable way to score the final word-vectors on your real task (or a close/correlated simulation), then trying a range of values to see which scores best on your evaluation.

- 52,260
- 14
- 86
- 115