All in all I need to run multiple word2vec over a period of time. For example I will be running word2vec once every month. To reduce computing workload I would like to run word2vec only on the data that was accumulated during the last month. My problem stems from the fact that for further processing I require the embeddings from the models I ran in previous months.
I know, also from reading other posts, that if the individual word2vec models are run on different samples which each are not a representative sample of an overarching corpus, obtaining word embeddings that are comparable is not possible. I have a similar problem, where I am analysing network data, which evolves over time (effectively doing a kind of graph2vec, but analysing node behaviour).
Yet I've been wondering if comparable embeddings can be achieved using PCA as follows:
- all models create "node" embeddings of length x
- for each model:
- run PCA on the "node" embeddings and retain all x principal components, whereby whitening is enabled
- transform the individual "node" embeddings to their corresponding PCA coordinates
- since the individual samples used to train the individual models share a high proportion of nodes as existing ones tend to stay and
new ones are likely to be added do the following:
- append all pca-transformed embeddings into one database
- by nodeID calculate mean pca_transformed embedding
This would only work, if the PCA transformation of the embeddings of each model ensures that the resulting embeddings measure the "same thing". For example, the first principal component of each PCA should capture the same kind of information, etc. And that's what I'm not sure about.