text2vec word embeddings : compound some tokens but not all

Question

I am using {text2vec} word embeddings to build a dictionary of similar terms pertaining to a certain semantic category.

Is it OK to compound some tokens in the corpus, but not all? For example, I want to calculate terms similar to “future generation” or “rising generation”, but these collocations occur as separate terms in the original corpus of course. I am wondering if it is bad practice to gsub "rising generation" --> "rising_generation", without compounding all other terms that occur frequently together such as “climate change.”

Thanks!

Have you already tried to read what is described in this question? https://datascience.stackexchange.com/questions/22572/how-can-i-get-semantic-word-embneddings-for-compound-terms — Elidor00, Oct 04 '20 at 13:44

score 0 · Accepted Answer · answered Oct 05 '20 at 04:08

Yes, it's fine. It may or may not work exactly the way you want but it's worth trying.

You might want to look at the code for collocations in text2vec, which can automatically detect and join phrases for you. You can certainly join phrases on top of that if you want. In Gensim in Python I would use the Phrases code for the same thing.

Given that training word vectors usually doesn't take too long, it's best to try different techniques and see which one works better for your goal.

text2vec word embeddings : compound some tokens but not all

1 Answers1