Vector representation for token and compound word

Question

I have a corpus of sentences. Each of them may contain marked compound words. For example:

This is an example_sentence followed by another awesome_paragraph

. I want to get embedding vector for all tokens and compound words

(this, is, an, example, sentence, followed, by, another, awesome, paragraph, example_sentence, awesome_paragraph)

Can I do this with gensim or which library should I use?

Yes, you can. According to the docs: https://radimrehurek.com/gensim/models/word2vec.html the `gensim.models.Word2vec` takes a param `sentences`as input which is a list of the tokens in the sentence. These are usually the words of the sentence. But you can have your own definition of tokens, as in your case. So, in your case, you need to implement your own small function that would produce the individual+compound tokens and then pass those to `gensim.models.Word2vec` — JARS, May 16 '18 at 07:56
Thanks to @JARS for the comment. Do you imply that a sentence will be generated twice with two versions: with and without compound tokens? In that case, the frequency of tokens outside the compounds will be doubled. Is it good? — Brody, May 16 '18 at 09:08
Gensim just takes whatever tokenization you provide. If you want to supply the same sentence once with compound-words, and once without, that'd work fine. What effect the doubling of frequencies would have would have to be tested based on your data and goals - it might help or hurt. (But it probably wouldn't hurt much - it's much like adding an extra training pass on the re-occuring words. And the downsampling controlled by the `sample` parameter already works to dampen the excess training of very-common words, which could help offset any overweighting they're getting.) — gojomo, May 16 '18 at 21:10

0 Answers0