Global word embedding for local word embeddings

Question

Imagine, based on some criteria, we have three vectors (vec1, vec2, vec3) for the word king, and we call these three vectors as local vectors for the king. Which way is sufficient to generate a global (single, or unique) vector for the word king from these three local vectors (vec1, vec2, vec3) that can be used in downstream task. There are three possible options:

Concat(vec1, vec2, vec3) 
average(vec1, vec2, vec3) 
sum(vec1, vec2, vec3)

are they sufficient? WHY? any reference?

score 1 · Answer 1 · answered Aug 26 '20 at 18:16

1

You haven't stated how those 3 vectors were created, and that matters. If the method they were created meant they all shared, in some important sense, the "same coordinate system", then it might be appropriate to add or average them.

But, if they're derived in unrelated ways, and thus their individual coordinates aren't part of the same self-consistent/comparable system, then concatenation makes more sense, preserving their individual information – forwarding all that info to downstream steps, without any assumptions about what's more important, nor allowing any 'cancelling-out' of position info from the random/arbitrary interaction of unrelated coordinate-systems.

Also, if vec1, vec2, and vec3 are of different-dimensionalities, concatenation always works, but sum/average won't.

(I could possibly give more reasoning if you added more concrete information about the different sources of vec1, vec2, vec3.)

answered Aug 26 '20 at 18:16

gojomo

52,260
14
86
115

Thank you. Imagine we have three dataset d1, d2, d3. The word king is common in all three datasets. We want to track the semantic change of the word king in these three datasets. To that end, we have to generate three different semantic vectors for the term king based on the context vectors co-occurring with the term king in each dataset. So, we assign an initial unique random vector with the same dimension to v1,v2,v3. We update these three vectors by skimming through d1,d2,d3, respectively. – sezar sampaio Aug 26 '20 at 21:25
If you train those 3 datasets separately, without any effort to keep them "in alignment", the various sources of randomization could send the `king` vectors to arbitrarily-different ending locations, even if in truth it has essentially or identically the same meaning. (You'll see this even repeating training with the exact same data. Ensuring identical initialization might offset this somewhat, but in an unquantifiable way, such that I'd not want to count on it.) In such a case, you might want to instead learn transformations between the independent spaces, perhaps based on some choice of… – gojomo Aug 26 '20 at 22:10
…'anchor words' that you have god reason to assume *don't* change in meaning between datasets. There's a class `TransformationMatrix`, in gensim, with an example notebook, that can learn such a mapping, which may also be useful for machine-translation. Or given that word2vec models are data-hungry, and you expect that most words are similar across datasets, you might train everything in one model, but for words who may have different senses in different examples, sometimes use an alternate dataset-specific token. Some ideas along this line in prior answers: – gojomo Aug 26 '20 at 22:21
https://stackoverflow.com/questions/57392103/word-embeddings-for-the-same-word-from-two-different-texts/57400356#57400356 and https://stackoverflow.com/questions/59084092/how-calculate-distance-between-2-node2vec-model/59095246#59095246 – gojomo Aug 26 '20 at 22:21

Global word embedding for local word embeddings

1 Answers1