Are Principal Components of different word2vec models measuring the same thing?

Question

All in all I need to run multiple word2vec over a period of time. For example I will be running word2vec once every month. To reduce computing workload I would like to run word2vec only on the data that was accumulated during the last month. My problem stems from the fact that for further processing I require the embeddings from the models I ran in previous months.

I know, also from reading other posts, that if the individual word2vec models are run on different samples which each are not a representative sample of an overarching corpus, obtaining word embeddings that are comparable is not possible. I have a similar problem, where I am analysing network data, which evolves over time (effectively doing a kind of graph2vec, but analysing node behaviour).

Yet I've been wondering if comparable embeddings can be achieved using PCA as follows:

all models create "node" embeddings of length x
for each model:
- run PCA on the "node" embeddings and retain all x principal components, whereby whitening is enabled
- transform the individual "node" embeddings to their corresponding PCA coordinates
since the individual samples used to train the individual models share a high proportion of nodes as existing ones tend to stay and new ones are likely to be added do the following:
- append all pca-transformed embeddings into one database
- by nodeID calculate mean pca_transformed embedding

This would only work, if the PCA transformation of the embeddings of each model ensures that the resulting embeddings measure the "same thing". For example, the first principal component of each PCA should capture the same kind of information, etc. And that's what I'm not sure about.

score 0 · Answer 1 · answered Jul 24 '19 at 19:14

I don't think you could count on the principle components from separate runs aligning like that.

With an "anchor set" of correlated words, you can learn a useful transformation from one word2vec space into another, to project words only known in one to another. This is most commonly used in language translation, but could be used for period-to-period transformations, where some set of common words for which you are willing to say "these should be considered 100% equivalent" would be the anchors. See the translation_matrix.ipynb demo notebook, bundled in the gensim docs/notebooks directory or viewable online here, for an example of using the TranslationMatrix utility class.

For the strongest model, you will want to occasionally re-train a model with all data, from all periods. It's only words that are co-trained, in an interleaved fashion, that wind up in fully comparable positions. (Any 'cheating'/optimization by just training on a subset will tend to pull the words that appear in that subset to positions optimal for just that subset, and away from other words whose full-range-of-uses only appear in the full corpus. Compare the idea of 'catatrophic forgetting'.

Without knowing for sure that it would work, I might try a mixed approach including:

an initial big training with as much data as possible, to start with the strongest possible model and largest possible vocabulary
if a full-data retraining isn't possible each period, instead create a corpus with (a) all new data; and (b) some random sample – as large as can be managed – of older data. Train a model on this corpus – but pre-initialize the model with as many sharable vectors/weights from "the big model" as possible, so that most shared/frequent words start in a useful, and compatible, alignment
after the per-period training, take the "new words" (1st appearing in this period), and perhaps any "highly influenced words" (where this period is contributing a lot to the total appearances of a word) from the "new model", and holding all other words as anchors, translate the new-model-words back into the big-model-space. Now, they should be comparable to all old-words. (And, to the extent non-anchor words appear in both models, but back-project to someplace different, it might be an interesting indicator of usage/meaning-drift.)

(This problem of growing/adapting word-vectors over time has some similarity to attempts to "fine-tune" more generic/public word-vectors to match some other corpus, so searching for published work about "word-vector fine-tuning" may turn up more relevant ideas.)

Thanks @gojomo, I'll look into it. – griischdoffer Jul 25 '19 at 08:15 — griischdoffer, Jul 25 '19 at 08:15

Are Principal Components of different word2vec models measuring the same thing?

1 Answers1