There's no official support in Gensim, so any approach would involve a lot of custom research-like innovation.
Neural models like the word2vec algorithm (but not Gensim) have been trained in a very-distributed/parallel fashion – see for example 'Hogwild' & related followup work, for asynchronous SGD. Very roughly, many separate simultaneous processes train separately & asynchronously, but keep updating each other intermittently, even without locking – & it works OK. (See more llinks in prior answer: https://stackoverflow.com/a/66283392/130288.)
But:
- still this is usually done for performance, & within a highly-connected datacenter – not for the sake of keeping separate data sources private, between institutions that may be less connected/trusting, or where the shards of data might in fact be very different in vocabulary/word-senses
- there's never been support in Gensim for this - though many years ago, in an older version of Gensim, someone whipped up a kinda sorta demo that purported to do such scatter/merge training via Spark – see https://github.com/dirkneumann/deepdist.
So: it's something a project could try to simulate, or test in practice, though the extra lags/etc of cross-"institution" updates might make it unpractical or ineffective. (And, they'd still have to initially consense on a shared vocabulary, which without due care would leak aspects of each's data.)
As you note, you could consider an approach where each trains one shared model in serial turns, which coudl very closely simulate a single training, albeit with the overhead of passing the interim model around, and no parallelism. Roughly:
- share word counts to reach a single consensus vocabulary
- for each intended training epoch, each institution would train one pass on its whole dataset, then pass the model to the next institution
- the calls to
.train()
would manually manage item counts & alpha
-related values to simulate one single SGD run
Note that there'd still be some hints of each instititions relative co-occurrences of terms, which would leak some info about their private datasets – perhaps most clearly on rare terms.
Still, if you weren't in a rush, that'd best simulate a single integrated model training.
I'd be tempted to try to fix the sharing concerns with some other trust-creating process or intermediary. (Is there an 3rd party that each could trust with their data, temporarily? Could a single shared training system be created which could only stream the individual datasets in for training, with no chance of saving/summarizing the full data? Might 4 cloud hosts, each under the separate institution's sole management but physically in a shared facility effect the above 'serial turns' approach with hardly any overhead?)
There's also the potential to map one model into another: taking a number of shared words as reference anchor points, learning a projection from one model to the other, which allows other non-reference-point words to be moved from one coordinate space to the other. This is has been mentioned as a tool for either extending a vocabulary with vectors from elsewhere (eg section 2.2 of the Kiros et al 'Skip-Thought Vectors' paper) or doing language translation (Mikolov et al 'Exploiting Similarities among Languages for Machine Translation' paper).
Gensim includes a TranslationMatrix
class for learning such projections. Conceivably the institutions could pick one common dataset, or one institution with the largest dataset, as the creator of some canonical starting model. Then each institution creates their own models based on private data. Then, based on some set of 'anchor words' (that are assumed to have stable meaning across all models, perhaps because they are very common) each of these followup models are projected into the canonical space - allowing words that are either unique to each model to be moved into the shared model, or words that vary a lot across models to be projected to contrasting points in the same space (that it might then make sense to average together).