Comparing corpora in Word2Vec

Question

For my masters thesis I am using the Word2Vec toolkit from gensim to analyze the language of scientific publications in Biology and philosophy.

In order to compare between both disciplines I used a method suggested as an answer to a different question: Can we compare word vectors from different models using transfer learning?

Based in this answer I looked at the overlapping words in both corpora (after cleaning and stemming) and added a corpus specific tag to the most frequently used words.

Interestingly, the results based on this method show some variation with a different method of comparing corpora (Compass Aligned Distributional Embeddings as developed by Bianchi et al., 2020, see https://federicobianchi.io/cade/).

Could someone perhaps refer me to any work using the tagging method to compare pros and cons with these (and possibly other) methodologies.

With the tagging method, the Word2Vec (cosine) similarity is very low in the overlapping words of abstracts published on Consciousness.

CADE on the other hand shows the reverse higher between discipline similarity (even compared to the similarity within disciplines)

Comparing corpora in Word2Vec

0 Answers0