You can think of word 'vectors', numerically, as just points. It's not really significant that they all 'start' at the origin ([0.0, 0.0, 0.0, ..., 0.0]
).
The 'center' of any such vector is just its midpoint, which is also a vector of the same 'directionality' with half the magnitude. Often but not always, word-vectors are only compared in terms of raw-direction, not magnitude, via 'cosine similarity', which is essentially an angle-of-difference calculation that's oblvious to length/magnitude. (So, cosine_similarity(a, b)
will be the same as cosine_similarity(a/2, b)
or cosine_similarity(a, b*4)
, etc.) So this 'center'/half-length instance you've asked about is usually less meaningful, with word-vectors, than in other vector models. And in general, as long as you're using cosine-similarity as your main method of comparing vectors, moving them closer to the origin-point is irrelevant. So, in that framing, the origin point doesn't really have a distinct meaning.
Caveat with regard to magnitudes: the actual raw vectors created by word2vec training do in fact have a variety of magnitudes. Some have observed that these magnitudes sometimes correlate with interesting word differences – for example, highly polysemous words (with many alternate meanings) can often be lower-magnitude than words with one dominant meaning – as the need to "do something useful" in alternate contexts tugs the vector between extremes during training, leaving it more "in the middle". And while word-to-word comparisons usually ignore these magnitudes for the purely angular cosine-similarity, sometimes downstream uses, such as text classification, may do incrementally better keeping the raw magnitudes.
Caveat with regard to the origin point: At least one paper, "All-but-the-Top: Simple and Effective Postprocessing for Word Representations" by Mu, Bhat, & Viswanath, has observed that often the 'average' of all word-vectors isn't the origin-point, but significantly biased in one direction – which (in my stylized understanding) sort-of leaves the whole space imbalanced, in terms of whether it's using 'all angles' to represent contrasts-in-meaning. (Also, in my experiments, the extent of this imbalance seems a function of how many negative
examples are used in negative-sampling.) They found that postprocessing the vectors to recenter them improved performance on some tasks, but I've not seen many other projects adopt this as a standard step. (They also suggest some other postprocessing transformations to essentially 'increase contrast in the most valuable dimensions'.)
Regarding your "IIUC", yes, words are given starting vectors - but these are random, and then constantly adjusted via backprop-nudges, repeatedly after trying every training example in turn, to make those 'input word' vectors ever-so-slightly better as inputs to the neural network that's trying to predict nearby 'target/center/output' words. Both the networks 'internal'/'hidden' weights are adjusted, and the input vectors
themselves, which are essentially 'projection weights' – from a one-hot representation of a single vocabulary word, 'to' the M different internal hidden-layer nodes. That is, each 'word vector' is essentially a word-specific subset of the neural-networks' internal weights.