0

Anyone have experince, or idea of the best way to train a W2V model , while enrichnig is with geo-location context (using Gensim library)?

  1. I have a dataset of scripted conversations from different english specking coutries.
  2. I would like to train the model to understand the relation between words, but also consider the location in which the conversations took place.
  3. So when I'm "questioning" the model, I can give it a context to a certain country and potentially improve its relevancy.

What I have in mind, is to inject a geo-location ID with every phrase, as a (fake) word.

Example -

p1 [us, the, lion, king, is, a, great, movie, us] p2 [uk, king, charles, ascended, the, throne, uk]

The desired result should be something along the lines of:

vec(“us”) + vec(“king”) --> vec ("Lion") vec(“uk”) + vec(“king”) --> vec ("Charles")

Anyone have a more sturctured idea to do that, while still sticking to the Gensim library?

ssbbaa
  • 1
  • 1

1 Answers1

0

I've not seen any proven techniques for your need.

But, it is a bit similar to how people try to track the drift in word meanings over different eras. There's been some published work like HistWords from Stanford on that task.

I have also in past answers suggested people working on the eras-drift task try probabilistically replacing words whose sense may vary with alternate, context-labeled tokens. That is, if king is one of the words that you expect to vary based on your geography-contexts, expand your training corpus to sometimes replace king in UK contexts with king_UK, and in US contexts with king_US. (In some cases, you might even repeat your texts to do this.) Then, at the end of training, you'll have separate (but close) vectors for all of king, king_UK, & king_US – and the subtle difference between them may be reflective of what you're trying to study/capture.

You can see other discussion of related ideas in previous answers:

https://stackoverflow.com/a/57400356/130288

https://stackoverflow.com/a/59095246/130288

I'm not sure how well this approach might work, nor (if it does) optimal ways to transform the corpus to capture all the geography-flavored meaning-shifts.

I suspect the extreme approach of transforming every word in a UK-context to its UK-specific token, & same for other contexts, would work less well than only sometimes transforming the tokens – because a total transformation would mean each region's tokens only get trained with each other, never with shared (non-regionalized) words that help 'anchor' variant-meanings in the same shared overall context. But, that hunch would need to be tested.

(This simple "replace-some-tokens" strategy has the advantage that it can be done entirely via corpus preprocessing, with no change to the algorithms. If willing/able to perform big changes to the library, another approach could be more fasttext-like: treat every instance of king as a sum of both a generic king_en vector and a region king_UK (etc) vector. Then every usage example would update both.)

gojomo
  • 52,260
  • 14
  • 86
  • 115