24

I am trying to train a word2vec model on very short phrases (5 grams). Since each sentence or example is very short, I believe the window size I can use can atmost be 2. I am trying to understand what the implications of such a small window size are on the quality of the learned model, so that I can understand whether my model has learnt something meaningful or not. I tried training a word2vec model on 5-grams but it appears the learnt model does not capture semantics etc very well.

I am using the following test to evaluate the accuracy of model: https://code.google.com/p/word2vec/source/browse/trunk/questions-words.txt

I used gensim.Word2Vec to train a model and here is a snippet of my accuracy scores (using a window size of 2)

[{'correct': 2, 'incorrect': 304, 'section': 'capital-common-countries'},
 {'correct': 2, 'incorrect': 453, 'section': 'capital-world'},
 {'correct': 0, 'incorrect': 86, 'section': 'currency'},
 {'correct': 2, 'incorrect': 703, 'section': 'city-in-state'},
 {'correct': 123, 'incorrect': 183, 'section': 'family'},
 {'correct': 21, 'incorrect': 791, 'section': 'gram1-adjective-to-adverb'},
 {'correct': 8, 'incorrect': 544, 'section': 'gram2-opposite'},
 {'correct': 284, 'incorrect': 976, 'section': 'gram3-comparative'},
 {'correct': 67, 'incorrect': 863, 'section': 'gram4-superlative'},
 {'correct': 41, 'incorrect': 951, 'section': 'gram5-present-participle'},
 {'correct': 6, 'incorrect': 1089, 'section': 'gram6-nationality-adjective'},
 {'correct': 171, 'incorrect': 1389, 'section': 'gram7-past-tense'},
 {'correct': 56, 'incorrect': 936, 'section': 'gram8-plural'},
 {'correct': 52, 'incorrect': 705, 'section': 'gram9-plural-verbs'},
 {'correct': 835, 'incorrect': 9973, 'section': 'total'}]

I also tried running the demo-word-accuracy.sh script outlined here with a window size of 2 and get poor accuracy as well:

Sample output:
    capital-common-countries:
    ACCURACY TOP1: 19.37 %  (98 / 506)
    Total accuracy: 19.37 %   Semantic accuracy: 19.37 %   Syntactic accuracy: -nan % 
    capital-world:
    ACCURACY TOP1: 10.26 %  (149 / 1452)
    Total accuracy: 12.61 %   Semantic accuracy: 12.61 %   Syntactic accuracy: -nan % 
    currency:
    ACCURACY TOP1: 6.34 %  (17 / 268)
    Total accuracy: 11.86 %   Semantic accuracy: 11.86 %   Syntactic accuracy: -nan % 
    city-in-state:
    ACCURACY TOP1: 11.78 %  (185 / 1571)
    Total accuracy: 11.83 %   Semantic accuracy: 11.83 %   Syntactic accuracy: -nan % 
    family:
    ACCURACY TOP1: 57.19 %  (175 / 306)
    Total accuracy: 15.21 %   Semantic accuracy: 15.21 %   Syntactic accuracy: -nan % 
    gram1-adjective-to-adverb:
    ACCURACY TOP1: 6.48 %  (49 / 756)
    Total accuracy: 13.85 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 6.48 % 
    gram2-opposite:
    ACCURACY TOP1: 17.97 %  (55 / 306)
    Total accuracy: 14.09 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 9.79 % 
    gram3-comparative:
    ACCURACY TOP1: 34.68 %  (437 / 1260)
    Total accuracy: 18.13 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 23.30 % 
    gram4-superlative:
    ACCURACY TOP1: 14.82 %  (75 / 506)
    Total accuracy: 17.89 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 21.78 % 
    gram5-present-participle:
    ACCURACY TOP1: 19.96 %  (198 / 992)
    Total accuracy: 18.15 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 21.31 % 
    gram6-nationality-adjective:
    ACCURACY TOP1: 35.81 %  (491 / 1371)
    Total accuracy: 20.76 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 25.14 % 
    gram7-past-tense:
    ACCURACY TOP1: 19.67 %  (262 / 1332)
    Total accuracy: 20.62 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 24.02 % 
    gram8-plural:
    ACCURACY TOP1: 35.38 %  (351 / 992)
    Total accuracy: 21.88 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 25.52 % 
    gram9-plural-verbs:
    ACCURACY TOP1: 20.00 %  (130 / 650)
    Total accuracy: 21.78 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 25.08 % 
    Questions seen / total: 12268 19544   62.77 % 

However the word2vec site claims its possible to obtain an accuracy of ~60% on these tasks. Hence I would like to gain some insights into the effect of these hyperparameters like window size and how they affect quality of learnt models.

vkmv
  • 1,345
  • 1
  • 14
  • 24

2 Answers2

41

Very low scores on the analogy-questions are more likely due to limitations in the amount or quality of your training data, rather than mistuned parameters. (If your training phrases are really only 5 words each, they may not capture the same rich relations as can be discovered from datasets with full sentences.)

You could use a window of 5 on your phrases – the training code trims the window to what's available on either side – but then every word of each phrase affects all of the other words. That might be OK: one of the Google word2vec papers ("Distributed Representations of Words and Phrases and their Compositionality", https://arxiv.org/abs/1310.4546) mentions that to get the best accuracy on one of their phrase tasks, they used "the entire sentence for the context". (On the other hand, on one English corpus of short messages, I found a window size of just 2 created the vectors that scored best on the analogies-evaluation, so larger isn't necessarily better.)

A paper by Levy & Goldberg, "Dependency-Based Word Embeddings", speaks a bit about the qualitative effect of window-size:

https://levyomer.files.wordpress.com/2014/04/dependency-based-word-embeddings-acl-2014.pdf

They find:

Larger windows tend to capture more topic/domain information: what other words (of any type) are used in related discussions? Smaller windows tend to capture more about word itself: what other words are functionally similar? (Their own extension, the dependency-based embeddings, seems best at finding most-similar words, synonyms or obvious-alternatives that could drop-in as replacements of the origin word.)

Manuel Alves
  • 3,885
  • 2
  • 30
  • 24
gojomo
  • 52,260
  • 14
  • 86
  • 115
  • 3
    Great answer - Just a note that sometimes 'obvious alternatives' can be antonyms. I want to **confirm** my reservation / I want to **cancel** my reservation. – Doug T. Apr 30 '19 at 23:54
  • @gojomo I'm just wondering if I can use the entire sentence as the window size. If yes then could you please tell me how to do it? – Abu Shoeb May 21 '20 at 19:17
  • 3
    There's no setting for that, but choosing a `window` far larger than your largest sentence has essentially the same effect. – gojomo May 21 '20 at 23:00
  • 1
    Yap, I was considering the same. Thanks very much. – Abu Shoeb May 23 '20 at 20:12
14

To your question: "I am trying to understand what the implications of such a small window size are on the quality of the learned model".

For example "stackoverflow great website for programmers" with 5 words (suppose we save the stop words great and for here) if the window size is 2 then the vector of word "stackoverflow" is directly affected by the word "great" and "website", if the window size is 5 "stackoverflow" can be directly affected by two more words "for" and "programmers". The 'affected' here means it will pull the vector of two words closer.

So it depends on the material you are using for training, if the window size of 2 can capture the context of a word, but 5 is chosen, it will decrease the quality of the learnt model, and vise versa.

Leonid Glanz
  • 1,261
  • 2
  • 16
  • 36
michaeltang
  • 2,850
  • 15
  • 18